All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] IDLE gating in presence of latency-sensitive tasks
@ 2020-05-07 13:37 Parth Shah
  2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Parth Shah @ 2020-05-07 13:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, qais.yousef,
	chris.hyser, pkondeti, valentin.schneider, rjw

Abstract:
=========
The IDLE states provides a way to save power at the cost of wakeup
latencies. The deeper the IDLE state selected by the CPUIDLE governor, the
more the exit latency of the state will be. This exit latency adds to the
task's wakeup latency. Hence choosing the best trade-off between the power
save feature and the increase observable latency is what CPUIDLE governor
focus upon.

But CPUIDLE governor is generic in nature to provide best of both the
worlds. However, the CPUIDLE governor does not have the capability to
distinguish between latency sensitivity of tasks queued on the runqueue.

With the introduction of latency-nice attribute to provide the latency
requirements from the userspace, the CPUIDLE governor can use it to
identify latency sensitive tasks.

Hence, this patch-set restricts the CPU running latency-sensitive tasks to
go into any IDLE state, thus minimizing the impact of exit latency for such
tasks. Expected results is better power-saving and comparable performance
than compared to disabling all IDLE states.

- Why not use PM_QoS?
PM_QoS provide ways to assist CPUIDLE governor decision with per device(s)
(like CPU, network, etc.) attributes.  This behavior of decision assistance
at "per device" level lacks in providing the best of both power saving and
the latency minimization for specific task only. Hence, PM_QoS like feature
can be clubbed with scheduler to retrieve latency_nice information of a
task and provide better decision at per task level.  This leads to
providing better performance to only those CPUs which is queued up for
running latency sensitive tasks and thus saving power from other CPUs.


Implementation:
===============
- latency sensitive tasks are the task's marked with latency_nice == -20.
- Use the per-CPU variable to keep track of the (to be) queued up latency
  sensitive tasks.
- CPUIDLE governor does not choose a non-polling idle state on such marked
  CPUS until the percpu counter goes back to zero

This strategy solves many latency related problems for the tasks showing
sleep-wake-sleep pattern (basically most of GPU workloads, schbench,
RT-app, database workloads, and many more).

This series is based on latency_nice patch-set
PATCH v5 https://lkml.org/lkml/2020/2/28/166

One may use below file to set latency_nice value:
- lnice.c: lnice -l -20 <workload>
  https://github.com/parthsl/tools/blob/master/misc_scripts/lnice.c
or
- relnice.c: relnice -p <PID> -l 19
  https://github.com/parthsl/tools/blob/master/misc_scripts/relnice.c


Results:
========
# Baseline = tip/sched/core + latency_nice patches
# w/ patch = Baseline + this patch-set + workload's latency_nice = -20

=> Schbench (Lower is better)
-----------------------------
- Baseline:
$> schbench -r 30
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch  |
+---------------------+----------+-------------------------+-----------+
| 50                  |      371 | 21                      | 22        |
| 90                  |      729 | 33                      | 37        |
| 99                  |      889 | 40 (-95%)               | 48 (-94%) |
| max                 |     2197 | 59                      | 655       |
| Avg. Energy (Watts) |       73 | 96 (+31%)               | 88 (+20%) |
+---------------------+----------+-------------------------+-----------+

$> schbench -r 10 -m 2 -t 1
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch  |
+---------------------+----------+-------------------------+-----------+
| 50                  |      336 | 5                       | 4         |
| 90                  |      464 | 7                       | 6         |
| 99                  |      579 | 10 (-98%)               | 11 (-98%) |
| max                 |      691 | 12                      | 489       |
| Avg. Energy (Watts) |       28 | 40 (+42%)               | 33 (+17%) |
+---------------------+----------+-------------------------+-----------+


=> PostgreSQL (lower is better):
----------------------------------
- 44 Clients running in parallel
$> pgbench -T 30 -S -n -R 10  -c 44
+---------------------+----------+-------------------------+--------------+
|                     | Baseline | cpupower idle-set -D 10 |   w/ patch   |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms)   |    2.028 | 0.424 (-80%)            | 1.202 (-40%) |
| latency stddev      |    3.149 | 0.473                   | 0.234        |
| trans. completed    |      294 | 304 (+3%)               | 300 (+2%)    |
| Avg. Energy (Watts) |     23.6 | 42.5 (+80%)             | 26.5 (+20%)  |
+---------------------+----------+-------------------------+--------------+

- 1 Client running
$> pgbench -T 30 -S -n -R 10 -c 1
+---------------------+----------+-------------------------+--------------+
|                     | Baseline | cpupower idle-set -D 10 |   w/ patch   |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms)   |    1.292 | 0.282 (-78%)            | 0.237 (-81%) |
| latency stddev      |    0.572 | 0.126                   | 0.116        |
| trans. completed    |      294 | 268 (-8%)               | 315 (+7%)    |
| Avg. Energy (Watts) |      9.8 | 29.6 (+302%)            | 27.7 (+282%) |
+---------------------+----------+-------------------------+--------------+
*trans. completed = Total transactions processed (Higher is better)


Parth Shah (4):
  sched/core: Introduce per_cpu counter to track latency sensitive tasks
  sched/core: Set nr_lat_sensitive counter at various scheduler
    entry/exit points
  sched/idle: Disable idle call on least latency requirements
  sched/idle: Add debugging bits to validate inconsistency in latency
    sensitive task calculations

 kernel/sched/core.c  | 32 ++++++++++++++++++++++++++++++--
 kernel/sched/idle.c  |  8 +++++++-
 kernel/sched/sched.h |  7 +++++++
 3 files changed, 44 insertions(+), 3 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks
  2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
@ 2020-05-07 13:37 ` Parth Shah
  2020-05-08  8:40   ` Pavan Kondeti
  2020-05-07 13:37 ` [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points Parth Shah
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-07 13:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, qais.yousef,
	chris.hyser, pkondeti, valentin.schneider, rjw

The "nr_lat_sensitive" per_cpu variable provides hints on the possible
number of latency-sensitive tasks occupying the CPU. This hints further
helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
(next patches includes this).

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2576fd8cacf9..2d8b76f41d61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6606,6 +6606,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(int, nr_lat_sensitive);
 
 void __init sched_init(void)
 {
@@ -6737,6 +6738,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		per_cpu(nr_lat_sensitive, i) = 0;
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2c86dfe913e..5c41020c530e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1439,6 +1439,8 @@ DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+DECLARE_PER_CPU(int, nr_lat_sensitive);
+
 extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
  2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
  2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
@ 2020-05-07 13:37 ` Parth Shah
  2020-05-08  8:33   ` Pavan Kondeti
  2020-05-07 13:37 ` [RFC 3/4] sched/idle: Disable idle call on least latency requirements Parth Shah
  2020-05-07 13:37 ` [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations Parth Shah
  3 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-07 13:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, qais.yousef,
	chris.hyser, pkondeti, valentin.schneider, rjw

Monitor tasks at:
1. wake_up_new_task() - forked tasks

2. set_task_cpu() - task migrations, Load balancer

3. __sched_setscheduler() - set/unset latency_nice value
Increment the nr_lat_sensitive count on the CPU with task marked with
latency_nice == -20.
Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
with >-20 latency_nice task.

4. finish_task_switch() - dying task

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  5 +++++
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8b76f41d61..ad396c36eba6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		if (task_is_lat_sensitive(p)) {
+			per_cpu(nr_lat_sensitive, task_cpu(p))--;
+			per_cpu(nr_lat_sensitive, new_cpu)++;
+		}
+
 		if (p->sched_class->migrate_task_rq)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
@@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
 {
 	struct rq_flags rf;
 	struct rq *rq;
+	int target_cpu = 0;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	p->state = TASK_RUNNING;
@@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
 	 * as we're not fully set-up yet.
 	 */
 	p->recent_used_cpu = task_cpu(p);
-	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
+	__set_task_cpu(p, target_cpu);
+
 #endif
 	rq = __task_rq_lock(p, &rf);
+
+#ifdef CONFIG_SMP
+	if (task_is_lat_sensitive(p))
+		per_cpu(nr_lat_sensitive, target_cpu)++;
+#endif
+
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
@@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
+		if (task_is_lat_sensitive(prev))
+			per_cpu(nr_lat_sensitive, prev->cpu)--;
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
@@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
 	p->normal_prio = normal_prio(p);
 	set_load_weight(p, true);
 
-	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		if (p->state != TASK_DEAD &&
+		    attr->sched_latency_nice != p->latency_nice) {
+			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
+				per_cpu(nr_lat_sensitive, task_cpu(p))++;
+			else if (task_is_lat_sensitive(p))
+				per_cpu(nr_lat_sensitive, task_cpu(p))--;
+		}
+
 		p->latency_nice = attr->sched_latency_nice;
+	}
 }
 
 /* Actually do priority change: must hold pi & rq lock. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c41020c530e..56f885e37451 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
+static inline int task_is_lat_sensitive(struct task_struct *p)
+{
+	return p->latency_nice == MIN_LATENCY_NICE;
+}
+
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 3/4] sched/idle: Disable idle call on least latency requirements
  2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
  2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
  2020-05-07 13:37 ` [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points Parth Shah
@ 2020-05-07 13:37 ` Parth Shah
  2020-05-08  8:36   ` Pavan Kondeti
  2020-05-07 13:37 ` [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations Parth Shah
  3 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-07 13:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, qais.yousef,
	chris.hyser, pkondeti, valentin.schneider, rjw

Restrict the call to deeper idle states when the given CPU has been set for
the least latency requirements

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/idle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..85d72a6e2521 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -262,7 +262,8 @@ static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+		if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
+		    per_cpu(nr_lat_sensitive, cpu)) {
 			tick_nohz_idle_restart_tick();
 			cpu_idle_poll();
 		} else {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations
  2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
                   ` (2 preceding siblings ...)
  2020-05-07 13:37 ` [RFC 3/4] sched/idle: Disable idle call on least latency requirements Parth Shah
@ 2020-05-07 13:37 ` Parth Shah
  3 siblings, 0 replies; 15+ messages in thread
From: Parth Shah @ 2020-05-07 13:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, qais.yousef,
	chris.hyser, pkondeti, valentin.schneider, rjw

We monitor the task entering/exiting the scheduler, but there might be
unhandled situations which can lead to inconsistent value of the
nr_lat_sensitive counter. This may lead to restricting the use of IDLE
states despite absence of any latency sensitive workload.
Hence, add pr_info() if a negative value of nr_lat_sensitive value is found.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 kernel/sched/idle.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 85d72a6e2521..7aa0775e69c0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -231,6 +231,11 @@ static void cpuidle_idle_call(void)
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	int pm_disabled = per_cpu(nr_lat_sensitive, cpu);
+
+	if (pm_disabled < 0)
+		pr_info("Inconsistent value of nr_lat_sensitive counter\n");
+
 	/*
 	 * If the arch has a polling bit, we maintain an invariant:
 	 *
@@ -263,7 +268,7 @@ static void do_idle(void)
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
 		if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
-		    per_cpu(nr_lat_sensitive, cpu)) {
+		    pm_disabled) {
 			tick_nohz_idle_restart_tick();
 			cpu_idle_poll();
 		} else {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
  2020-05-07 13:37 ` [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points Parth Shah
@ 2020-05-08  8:33   ` Pavan Kondeti
  2020-05-08 11:15     ` Parth Shah
  0 siblings, 1 reply; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-08  8:33 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

Hi Parth,

On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> Monitor tasks at:
> 1. wake_up_new_task() - forked tasks
> 
> 2. set_task_cpu() - task migrations, Load balancer
> 
> 3. __sched_setscheduler() - set/unset latency_nice value
> Increment the nr_lat_sensitive count on the CPU with task marked with
> latency_nice == -20.
> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> with >-20 latency_nice task.
> 
> 4. finish_task_switch() - dying task
> 


> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |  5 +++++
>  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8b76f41d61..ad396c36eba6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  	trace_sched_migrate_task(p, new_cpu);
>  
>  	if (task_cpu(p) != new_cpu) {
> +		if (task_is_lat_sensitive(p)) {
> +			per_cpu(nr_lat_sensitive, task_cpu(p))--;
> +			per_cpu(nr_lat_sensitive, new_cpu)++;
> +		}
> +

Since we can come here without rq locks, there is a possibility
of a race and incorrect updates can happen. Since the counters
are used to prevent C-states, we don't want that to happen.

>  		if (p->sched_class->migrate_task_rq)
>  			p->sched_class->migrate_task_rq(p, new_cpu);
>  		p->se.nr_migrations++;
> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
>  {
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	int target_cpu = 0;
>  
>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>  	p->state = TASK_RUNNING;
> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
>  	 * as we're not fully set-up yet.
>  	 */
>  	p->recent_used_cpu = task_cpu(p);
> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> +	target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
> +	__set_task_cpu(p, target_cpu);
> +

The target_cpu variable can be eliminated by using task_cpu(p) directly
in the below update.

>  #endif
>  	rq = __task_rq_lock(p, &rf);
> +
> +#ifdef CONFIG_SMP
> +	if (task_is_lat_sensitive(p))
> +		per_cpu(nr_lat_sensitive, target_cpu)++;
> +#endif
> +

Is the SMP check intentional? In some parts of this patch, updates to
nr_lat_sensitive are done without SMP checks. For example,
finish_task_switch() below.

>  	update_rq_clock(rq);
>  	post_init_entity_util_avg(p);
>  
> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
>  
> +		if (task_is_lat_sensitive(prev))
> +			per_cpu(nr_lat_sensitive, prev->cpu)--;
> +
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>  	p->normal_prio = normal_prio(p);
>  	set_load_weight(p, true);
>  
> -	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> +	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> +		if (p->state != TASK_DEAD &&
> +		    attr->sched_latency_nice != p->latency_nice) {
> +			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> +				per_cpu(nr_lat_sensitive, task_cpu(p))++;
> +			else if (task_is_lat_sensitive(p))
> +				per_cpu(nr_lat_sensitive, task_cpu(p))--;
> +		}
> +
>  		p->latency_nice = attr->sched_latency_nice;
> +	}
>  }

There is a potential race here due to which we can mess up the refcount.

- A latency sensitive task is marked TASK_DEAD
<snip>
- sched_setattr() called on the task to clear the latency nice. Since
we check the task state here, we skip the decrement.
- The task is finally context switched out and we skip the decrement again
since it is not a latency senstivie task.

>  
>  /* Actually do priority change: must hold pi & rq lock. */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5c41020c530e..56f885e37451 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
>  	return dl_policy(p->policy);
>  }
>  
> +static inline int task_is_lat_sensitive(struct task_struct *p)
> +{
> +	return p->latency_nice == MIN_LATENCY_NICE;
> +}
> +
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
>  /*
> -- 
> 2.17.2
> 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements
  2020-05-07 13:37 ` [RFC 3/4] sched/idle: Disable idle call on least latency requirements Parth Shah
@ 2020-05-08  8:36   ` Pavan Kondeti
  2020-05-08 11:19     ` Parth Shah
  0 siblings, 1 reply; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-08  8:36 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> Restrict the call to deeper idle states when the given CPU has been set for
> the least latency requirements
> 
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> ---
>  kernel/sched/idle.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..85d72a6e2521 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -262,7 +262,8 @@ static void do_idle(void)
>  		 * broadcast device expired for us, we don't want to go deep
>  		 * idle as we know that the IPI is going to arrive right away.
>  		 */
> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> +		if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> +		    per_cpu(nr_lat_sensitive, cpu)) {
>  			tick_nohz_idle_restart_tick();
>  			cpu_idle_poll();
>  		} else {
> -- 
> 2.17.2
> 

Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
task becomes non-latency sensitive task), we may need to add this condition
in cpu_idle_poll() as well.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks
  2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
@ 2020-05-08  8:40   ` Pavan Kondeti
  2020-05-08 11:30     ` Parth Shah
  0 siblings, 1 reply; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-08  8:40 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

On Thu, May 07, 2020 at 07:07:20PM +0530, Parth Shah wrote:
> The "nr_lat_sensitive" per_cpu variable provides hints on the possible
> number of latency-sensitive tasks occupying the CPU. This hints further
> helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
> (next patches includes this).
> 

Can you please explain the intended use case here? Once a latency sensitive
task is created, it prevents c-state on a CPU whether the task runs again
or not in the near future.

I assume, either these latency sensitive tasks won't be around for long time
or applications set/reset latency sensitive nice value dynamically.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
  2020-05-08  8:33   ` Pavan Kondeti
@ 2020-05-08 11:15     ` Parth Shah
  2020-05-09  2:39       ` Pavan Kondeti
  0 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-08 11:15 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

Hi Pavan,

Thanks for going through this patch-set.

On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> Hi Parth,
> 
> On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
>> Monitor tasks at:
>> 1. wake_up_new_task() - forked tasks
>>
>> 2. set_task_cpu() - task migrations, Load balancer
>>
>> 3. __sched_setscheduler() - set/unset latency_nice value
>> Increment the nr_lat_sensitive count on the CPU with task marked with
>> latency_nice == -20.
>> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
>> with >-20 latency_nice task.
>>
>> 4. finish_task_switch() - dying task
>>
> 
> 
>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>> ---
>>  kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
>>  kernel/sched/sched.h |  5 +++++
>>  2 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2d8b76f41d61..ad396c36eba6 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>  	trace_sched_migrate_task(p, new_cpu);
>>  
>>  	if (task_cpu(p) != new_cpu) {
>> +		if (task_is_lat_sensitive(p)) {
>> +			per_cpu(nr_lat_sensitive, task_cpu(p))--;
>> +			per_cpu(nr_lat_sensitive, new_cpu)++;
>> +		}
>> +
> 
> Since we can come here without rq locks, there is a possibility
> of a race and incorrect updates can happen. Since the counters
> are used to prevent C-states, we don't want that to happen.

I did tried using task_lock(p) wherever we do change refcount and when
latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
type.

After lots of thinking to optimize it and thinking that we anyways hold rq
lock, I thought of not using any lock here and see if scheduler community
has well known solution for this :-)

But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
refcount should solve problem, right?

If you or anyone have solution for this kind of pattern, then that surely
will be helpful.

> 
>>  		if (p->sched_class->migrate_task_rq)
>>  			p->sched_class->migrate_task_rq(p, new_cpu);
>>  		p->se.nr_migrations++;
>> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
>>  {
>>  	struct rq_flags rf;
>>  	struct rq *rq;
>> +	int target_cpu = 0;
>>  
>>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>>  	p->state = TASK_RUNNING;
>> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
>>  	 * as we're not fully set-up yet.
>>  	 */
>>  	p->recent_used_cpu = task_cpu(p);
>> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> +	target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
>> +	__set_task_cpu(p, target_cpu);
>> +
> 
> The target_cpu variable can be eliminated by using task_cpu(p) directly
> in the below update.

Right. Will change it thus saving one diff line.

> 
>>  #endif
>>  	rq = __task_rq_lock(p, &rf);
>> +
>> +#ifdef CONFIG_SMP
>> +	if (task_is_lat_sensitive(p))
>> +		per_cpu(nr_lat_sensitive, target_cpu)++;
>> +#endif
>> +
> 
> Is the SMP check intentional? In some parts of this patch, updates to
> nr_lat_sensitive are done without SMP checks. For example,
> finish_task_switch() below.

No. just forget to remove. I will remove SMP checks in next revision.

> 
>>  	update_rq_clock(rq);
>>  	post_init_entity_util_avg(p);
>>  
>> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>>  		if (prev->sched_class->task_dead)
>>  			prev->sched_class->task_dead(prev);
>>  
>> +		if (task_is_lat_sensitive(prev))
>> +			per_cpu(nr_lat_sensitive, prev->cpu)--;
>> +
>>  		/*
>>  		 * Remove function-return probe instances associated with this
>>  		 * task and put them back on the free list.
>> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>>  	p->normal_prio = normal_prio(p);
>>  	set_load_weight(p, true);
>>  
>> -	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
>> +	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
>> +		if (p->state != TASK_DEAD &&
>> +		    attr->sched_latency_nice != p->latency_nice) {
>> +			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
>> +				per_cpu(nr_lat_sensitive, task_cpu(p))++;
>> +			else if (task_is_lat_sensitive(p))
>> +				per_cpu(nr_lat_sensitive, task_cpu(p))--;
>> +		}
>> +
>>  		p->latency_nice = attr->sched_latency_nice;
>> +	}
>>  }
> 
> There is a potential race here due to which we can mess up the refcount.
> 
> - A latency sensitive task is marked TASK_DEAD
> <snip>
> - sched_setattr() called on the task to clear the latency nice. Since
> we check the task state here, we skip the decrement.
> - The task is finally context switched out and we skip the decrement again
> since it is not a latency senstivie task.

if task is already marked TASK_DEAD then we should have already decremented
its refcount in finish_task_switch().
am I missing something?

> 
>>  
>>  /* Actually do priority change: must hold pi & rq lock. */
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 5c41020c530e..56f885e37451 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
>>  	return dl_policy(p->policy);
>>  }
>>  
>> +static inline int task_is_lat_sensitive(struct task_struct *p)
>> +{
>> +	return p->latency_nice == MIN_LATENCY_NICE;
>> +}
>> +
>>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>>  
>>  /*
>> -- 
>> 2.17.2
>>
> 
> Thanks,
> Pavan
> 

Thanks,
Parth

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements
  2020-05-08  8:36   ` Pavan Kondeti
@ 2020-05-08 11:19     ` Parth Shah
  2020-05-09  2:18       ` Pavan Kondeti
  0 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-08 11:19 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

Hi Pavan,

On 5/8/20 2:06 PM, Pavan Kondeti wrote:
> On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
>> Restrict the call to deeper idle states when the given CPU has been set for
>> the least latency requirements
>>
>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>> ---
>>  kernel/sched/idle.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..85d72a6e2521 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -262,7 +262,8 @@ static void do_idle(void)
>>  		 * broadcast device expired for us, we don't want to go deep
>>  		 * idle as we know that the IPI is going to arrive right away.
>>  		 */
>> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> +		if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
>> +		    per_cpu(nr_lat_sensitive, cpu)) {
>>  			tick_nohz_idle_restart_tick();
>>  			cpu_idle_poll();
>>  		} else {
>> -- 
>> 2.17.2
>>
> 
> Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
> task becomes non-latency sensitive task), we may need to add this condition
> in cpu_idle_poll() as well.
> 

Right. but if the CPU is running idle_task then it will again come back to
do_idle and read the refcount. Its a penalty in power-saving for 1
do_idle() loop but it is difficult to put up checks for any change in
latency_nice value.

Thanks,
Parth

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks
  2020-05-08  8:40   ` Pavan Kondeti
@ 2020-05-08 11:30     ` Parth Shah
  2020-05-09  2:14       ` Pavan Kondeti
  0 siblings, 1 reply; 15+ messages in thread
From: Parth Shah @ 2020-05-08 11:30 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw



On 5/8/20 2:10 PM, Pavan Kondeti wrote:
> On Thu, May 07, 2020 at 07:07:20PM +0530, Parth Shah wrote:
>> The "nr_lat_sensitive" per_cpu variable provides hints on the possible
>> number of latency-sensitive tasks occupying the CPU. This hints further
>> helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
>> (next patches includes this).
>>
> 
> Can you please explain the intended use case here? Once a latency sensitive
> task is created, it prevents c-state on a CPU whether the task runs again
> or not in the near future.
> 
> I assume, either these latency sensitive tasks won't be around for long time
> or applications set/reset latency sensitive nice value dynamically.
> 

Intended use-cases is to get rid of IDLE states exit_latency for
wakeup-sleep-wakeup pattern workload. This types of tasks (like GPU
workloads, few DB benchmarks) makes CPU go IDLE due to its low runtime on
rq, resulting in higher wakeups due to IDLE states exit_latency.

And this kind of workloads may last for long time as well.

In current scenario, Sysadmins do disable all IDLE states or use PM_QoS to
not have latency penalty on workload. This model was good when core counts
were less. But now higher core count and Turbo frequencies have led to save
power in-order to get higher performance and hence this patch-set tries to
do PM_QoS like thing but at per-task granularity.

If idea seems good to go, then this can potentially be extended to do IDLE
gating upto certain level where latency_nice value hints on which IDLE
states can't be chosen, just like PM_QoS have cpu_dma_latency constraints.


Thanks,
Parth


> Thanks,
> Pavan
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks
  2020-05-08 11:30     ` Parth Shah
@ 2020-05-09  2:14       ` Pavan Kondeti
  0 siblings, 0 replies; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-09  2:14 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

On Fri, May 08, 2020 at 05:00:44PM +0530, Parth Shah wrote:
> 
> 
> On 5/8/20 2:10 PM, Pavan Kondeti wrote:
> > On Thu, May 07, 2020 at 07:07:20PM +0530, Parth Shah wrote:
> >> The "nr_lat_sensitive" per_cpu variable provides hints on the possible
> >> number of latency-sensitive tasks occupying the CPU. This hints further
> >> helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
> >> (next patches includes this).
> >>
> > 
> > Can you please explain the intended use case here? Once a latency sensitive
> > task is created, it prevents c-state on a CPU whether the task runs again
> > or not in the near future.
> > 
> > I assume, either these latency sensitive tasks won't be around for long time
> > or applications set/reset latency sensitive nice value dynamically.
> > 
> 
> Intended use-cases is to get rid of IDLE states exit_latency for
> wakeup-sleep-wakeup pattern workload. This types of tasks (like GPU
> workloads, few DB benchmarks) makes CPU go IDLE due to its low runtime on
> rq, resulting in higher wakeups due to IDLE states exit_latency.
> 
> And this kind of workloads may last for long time as well.
> 
> In current scenario, Sysadmins do disable all IDLE states or use PM_QoS to
> not have latency penalty on workload. This model was good when core counts
> were less. But now higher core count and Turbo frequencies have led to save
> power in-order to get higher performance and hence this patch-set tries to
> do PM_QoS like thing but at per-task granularity.
> 
Thanks for the details. Instead of disabling C-states for all CPUs, we disable
it for CPUS that host latency sensitive tasks. Since this is hooked with the
scheduler, the task migrations are accounted for. Got it.

Thanks,
Pavan

> If idea seems good to go, then this can potentially be extended to do IDLE
> gating upto certain level where latency_nice value hints on which IDLE
> states can't be chosen, just like PM_QoS have cpu_dma_latency constraints.
> 
> 
> Thanks,
> Parth
> 
> 
> > Thanks,
> > Pavan
> > 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements
  2020-05-08 11:19     ` Parth Shah
@ 2020-05-09  2:18       ` Pavan Kondeti
  0 siblings, 0 replies; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-09  2:18 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

On Fri, May 08, 2020 at 04:49:04PM +0530, Parth Shah wrote:
> Hi Pavan,
> 
> On 5/8/20 2:06 PM, Pavan Kondeti wrote:
> > On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> >> Restrict the call to deeper idle states when the given CPU has been set for
> >> the least latency requirements
> >>
> >> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> >> ---
> >>  kernel/sched/idle.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index b743bf38f08f..85d72a6e2521 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -262,7 +262,8 @@ static void do_idle(void)
> >>  		 * broadcast device expired for us, we don't want to go deep
> >>  		 * idle as we know that the IPI is going to arrive right away.
> >>  		 */
> >> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> >> +		if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> >> +		    per_cpu(nr_lat_sensitive, cpu)) {
> >>  			tick_nohz_idle_restart_tick();
> >>  			cpu_idle_poll();
> >>  		} else {
> >> -- 
> >> 2.17.2
> >>
> > 
> > Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
> > task becomes non-latency sensitive task), we may need to add this condition
> > in cpu_idle_poll() as well.
> > 
> 
> Right. but if the CPU is running idle_task then it will again come back to
> do_idle and read the refcount. Its a penalty in power-saving for 1
> do_idle() loop but it is difficult to put up checks for any change in
> latency_nice value.
> 

Yes, when the CPU exits to serve an interrupt, we note the change in latency
task count and enter C-state.

I asked this because all checks that are present here are also present in
cpu_idle_poll().

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
  2020-05-08 11:15     ` Parth Shah
@ 2020-05-09  2:39       ` Pavan Kondeti
  2020-05-12  7:51         ` Parth Shah
  0 siblings, 1 reply; 15+ messages in thread
From: Pavan Kondeti @ 2020-05-09  2:39 UTC (permalink / raw)
  To: Parth Shah
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw

On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
> Hi Pavan,
> 
> Thanks for going through this patch-set.
> 
> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> > Hi Parth,
> > 
> > On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> >> Monitor tasks at:
> >> 1. wake_up_new_task() - forked tasks
> >>
> >> 2. set_task_cpu() - task migrations, Load balancer
> >>
> >> 3. __sched_setscheduler() - set/unset latency_nice value
> >> Increment the nr_lat_sensitive count on the CPU with task marked with
> >> latency_nice == -20.
> >> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> >> with >-20 latency_nice task.
> >>
> >> 4. finish_task_switch() - dying task
> >>
> > 
> > 
> >> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> >> ---
> >>  kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
> >>  kernel/sched/sched.h |  5 +++++
> >>  2 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8b76f41d61..ad396c36eba6 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >>  	trace_sched_migrate_task(p, new_cpu);
> >>  
> >>  	if (task_cpu(p) != new_cpu) {
> >> +		if (task_is_lat_sensitive(p)) {
> >> +			per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +			per_cpu(nr_lat_sensitive, new_cpu)++;
> >> +		}
> >> +
> > 
> > Since we can come here without rq locks, there is a possibility
> > of a race and incorrect updates can happen. Since the counters
> > are used to prevent C-states, we don't want that to happen.
> 
> I did tried using task_lock(p) wherever we do change refcount and when
> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
> type.
> 
> After lots of thinking to optimize it and thinking that we anyways hold rq
> lock, I thought of not using any lock here and see if scheduler community
> has well known solution for this :-)
> 
> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
> refcount should solve problem, right?
> 
> If you or anyone have solution for this kind of pattern, then that surely
> will be helpful.
> 
I am not sure if task_lock() can help here, because we are operating the
counter on per CPU basis here. May be cmpxchg based loop works here to make
sure that increment/decrement operation happens atomically here.

> > 
> >>  		if (p->sched_class->migrate_task_rq)
> >>  			p->sched_class->migrate_task_rq(p, new_cpu);
> >>  		p->se.nr_migrations++;

[...]

> >> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
> >>  	p->normal_prio = normal_prio(p);
> >>  	set_load_weight(p, true);
> >>  
> >> -	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> >> +	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> >> +		if (p->state != TASK_DEAD &&
> >> +		    attr->sched_latency_nice != p->latency_nice) {
> >> +			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> >> +				per_cpu(nr_lat_sensitive, task_cpu(p))++;
> >> +			else if (task_is_lat_sensitive(p))
> >> +				per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +		}
> >> +
> >>  		p->latency_nice = attr->sched_latency_nice;
> >> +	}
> >>  }
> > 
> > There is a potential race here due to which we can mess up the refcount.
> > 
> > - A latency sensitive task is marked TASK_DEAD
> > <snip>
> > - sched_setattr() called on the task to clear the latency nice. Since
> > we check the task state here, we skip the decrement.
> > - The task is finally context switched out and we skip the decrement again
> > since it is not a latency senstivie task.
> 
> if task is already marked TASK_DEAD then we should have already decremented
> its refcount in finish_task_switch().
> am I missing something?

There is a window (context switch and dropping rq lock) between
marking a task DEAD (in do_task_dead()) and dropping the ref counter
(in finish_task_switch()) during which we can run into here and skip
the checking because task is marked as DEAD.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
  2020-05-09  2:39       ` Pavan Kondeti
@ 2020-05-12  7:51         ` Parth Shah
  0 siblings, 0 replies; 15+ messages in thread
From: Parth Shah @ 2020-05-12  7:51 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: linux-kernel, peterz, mingo, vincent.guittot, dietmar.eggemann,
	qais.yousef, chris.hyser, valentin.schneider, rjw



On 5/9/20 8:09 AM, Pavan Kondeti wrote:
> On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
>> Hi Pavan,
>>
>> Thanks for going through this patch-set.
>>
>> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
>>> Hi Parth,
>>>
>>> On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
>>>> Monitor tasks at:
>>>> 1. wake_up_new_task() - forked tasks
>>>>
>>>> 2. set_task_cpu() - task migrations, Load balancer
>>>>
>>>> 3. __sched_setscheduler() - set/unset latency_nice value
>>>> Increment the nr_lat_sensitive count on the CPU with task marked with
>>>> latency_nice == -20.
>>>> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
>>>> with >-20 latency_nice task.
>>>>
>>>> 4. finish_task_switch() - dying task
>>>>
>>>
>>>
>>>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>>>> ---
>>>>  kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
>>>>  kernel/sched/sched.h |  5 +++++
>>>>  2 files changed, 33 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8b76f41d61..ad396c36eba6 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>>>  	trace_sched_migrate_task(p, new_cpu);
>>>>  
>>>>  	if (task_cpu(p) != new_cpu) {
>>>> +		if (task_is_lat_sensitive(p)) {
>>>> +			per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> +			per_cpu(nr_lat_sensitive, new_cpu)++;
>>>> +		}
>>>> +
>>>
>>> Since we can come here without rq locks, there is a possibility
>>> of a race and incorrect updates can happen. Since the counters
>>> are used to prevent C-states, we don't want that to happen.
>>
>> I did tried using task_lock(p) wherever we do change refcount and when
>> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
>> type.
>>
>> After lots of thinking to optimize it and thinking that we anyways hold rq
>> lock, I thought of not using any lock here and see if scheduler community
>> has well known solution for this :-)
>>
>> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
>> refcount should solve problem, right?
>>
>> If you or anyone have solution for this kind of pattern, then that surely
>> will be helpful.
>>
> I am not sure if task_lock() can help here, because we are operating the
> counter on per CPU basis here. May be cmpxchg based loop works here to make
> sure that increment/decrement operation happens atomically here.
> 
>>>
>>>>  		if (p->sched_class->migrate_task_rq)
>>>>  			p->sched_class->migrate_task_rq(p, new_cpu);
>>>>  		p->se.nr_migrations++;
> 
> [...]
> 
>>>> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>>>>  	p->normal_prio = normal_prio(p);
>>>>  	set_load_weight(p, true);
>>>>  
>>>> -	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
>>>> +	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
>>>> +		if (p->state != TASK_DEAD &&
>>>> +		    attr->sched_latency_nice != p->latency_nice) {
>>>> +			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
>>>> +				per_cpu(nr_lat_sensitive, task_cpu(p))++;
>>>> +			else if (task_is_lat_sensitive(p))
>>>> +				per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> +		}
>>>> +
>>>>  		p->latency_nice = attr->sched_latency_nice;
>>>> +	}
>>>>  }
>>>
>>> There is a potential race here due to which we can mess up the refcount.
>>>
>>> - A latency sensitive task is marked TASK_DEAD
>>> <snip>
>>> - sched_setattr() called on the task to clear the latency nice. Since
>>> we check the task state here, we skip the decrement.
>>> - The task is finally context switched out and we skip the decrement again
>>> since it is not a latency senstivie task.
>>
>> if task is already marked TASK_DEAD then we should have already decremented
>> its refcount in finish_task_switch().
>> am I missing something?
> 
> There is a window (context switch and dropping rq lock) between
> marking a task DEAD (in do_task_dead()) and dropping the ref counter
> (in finish_task_switch()) during which we can run into here and skip
> the checking because task is marked as DEAD.
> 

Yeah, TASK_DEAD seems to be genuine race conditions. At every other places
we do hold task_rq_lock() except when the task is dying. There is a window
between do_task_dead() and finish_task_switch() which may create race
condition, so if marking TASK_DEAD is protected under task_rq_lock() then
this can be prevented. I will have to look at it more thoroughly at the
code and figure out a way to protect the refcount under such circumstances.


Thanks,
Parth

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-12  7:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
2020-05-08  8:40   ` Pavan Kondeti
2020-05-08 11:30     ` Parth Shah
2020-05-09  2:14       ` Pavan Kondeti
2020-05-07 13:37 ` [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points Parth Shah
2020-05-08  8:33   ` Pavan Kondeti
2020-05-08 11:15     ` Parth Shah
2020-05-09  2:39       ` Pavan Kondeti
2020-05-12  7:51         ` Parth Shah
2020-05-07 13:37 ` [RFC 3/4] sched/idle: Disable idle call on least latency requirements Parth Shah
2020-05-08  8:36   ` Pavan Kondeti
2020-05-08 11:19     ` Parth Shah
2020-05-09  2:18       ` Pavan Kondeti
2020-05-07 13:37 ` [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations Parth Shah

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.