linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2] Fix race window during idle cpu selection.
@ 2017-12-05 19:09 Atish Patra
  2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra
  0 siblings, 1 reply; 3+ messages in thread
From: Atish Patra @ 2017-12-05 19:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: joelaf, jbacik, mingo, peterz, efault, urezki, atish.patra

This patch tries to fix a possible race window that lets multiple
tasks wakeup on the same cpu. schbench, uperf & Jackbench (Android)
mostly showed improvements while none of the other standard benchmarks
have shown any regression.

changes from v1->v2:
1. Added the performance numbers.
2. Did not include the debug patch.
   It can still be found at 
   https://lkml.org/lkml/2017/10/31/16

Atish Patra (1):
  sched: Minimize the idle cpu selection race window.

 kernel/sched/core.c      |  4 ++++
 kernel/sched/fair.c      | 12 +++++++++---
 kernel/sched/idle_task.c |  1 +
 kernel/sched/sched.h     |  1 +
 4 files changed, 15 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH RFC v2] sched: Minimize the idle cpu selection race window.
  2017-12-05 19:09 [PATCH RFC v2] Fix race window during idle cpu selection Atish Patra
@ 2017-12-05 19:09 ` Atish Patra
  2018-02-07 13:59   ` Matt Fleming
  0 siblings, 1 reply; 3+ messages in thread
From: Atish Patra @ 2017-12-05 19:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: joelaf, jbacik, mingo, peterz, efault, urezki, atish.patra

Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.

Here are some performance numbers:

Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
uperf config:ping pong test on loopback interface with message size = 8k

         Baseline (4.14)            Baseline    +pcpu
       Mean       stdev     Mean        stdev   Improvement(%)
1      9.056      0.02      8.966        0.083   -0.993
2      17.664     0.13      17.448       0.303   -1.222
4      32.03      0.22      31.972       0.129   -0.181
8      58.198     0.31      58.588       0.198   0.670
16     101.018    0.67      100.056      0.455   -0.952
32     148.1      15.41     164.494      2.312   11.069
64     203.66     1.16      203.042      1.348   -0.3073
128    197.12     1.04      194.722      1.174   -1.2165

schbench config: message threads = 2; time = 180s, worker thread = 19

        Baseline(4.14)          Base+pcpu
Max Latency
        Mean    stdev     Mean      stdev   Improvement(%)
        16457   2046.51    12985.4   48.46   21.0949747828

Latency in usec(lower is better)
        Mean     stdev        Mean    stdev    Improvement(%)
50%      63      4.774         58      4       7.936
75%      81.6    1.2           79.4    0.8     2.696
90%      92.8    0.979         90.6    0.489   2.370
95%      102.6   1.624         99      0.894   3.508
99%      126.25  4.573         120.4   3.872   4.63
99.5%    2712.2  4772.883      133.6   5.571   95.074

Reported by Joel:
Jackbench(Android)
count    16304   (@60 fps, 4.5 minutes)

        Without patch   With patch
mean         5.196633   4.429641 (+14.75%)
std          2.030054   2.310025
25%          5.606810   1.991017 (+64.48%)
50%          5.824013   5.716631 (+1.84%)
75%          5.987102   5.932751 (+0.90%)
95%          6.461230   6.301318 (+2.47%)
99%          9.828959   9.697076 (+1.34%)

The details are at:
https://lkml.org/lkml/2017/11/22/5

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Atish Patra <atish.patra@oracle.com>
---
 kernel/sched/core.c      |  4 ++++
 kernel/sched/fair.c      | 12 +++++++++---
 kernel/sched/idle_task.c |  1 +
 kernel/sched/sched.h     |  1 +
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2288a14..d9d501c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p)
 	return p->prio - MAX_RT_PRIO;
 }
 
+DEFINE_PER_CPU(int, claim_wakeup);
 /**
  * idle_cpu - is a given CPU idle currently?
  * @cpu: the processor in question.
@@ -3917,6 +3918,9 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
+	if (per_cpu(claim_wakeup, cpu))
+		return 0;
+
 	return 1;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13393bb..885023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 				idle = false;
 		}
 
-		if (idle)
+		if (idle) {
+			per_cpu(claim_wakeup, core) = 1;
 			return core;
+		}
 	}
 
 	/*
@@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
+		if (idle_cpu(cpu)) {
+			per_cpu(claim_wakeup, cpu) = 1;
 			return cpu;
+		}
 	}
 
 	return -1;
@@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
+		if (idle_cpu(cpu)) {
+			per_cpu(claim_wakeup, cpu) = 1;
 			break;
+		}
 	}
 
 	time = local_clock() - time;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..64d6495 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	put_prev_task(rq, prev);
 	update_idle_core(rq);
+	this_cpu_write(claim_wakeup, 0);
 	schedstat_inc(rq->sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8aa24b4..5f70b98 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
 
 struct sched_group_capacity {
 	atomic_t ref;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.
  2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra
@ 2018-02-07 13:59   ` Matt Fleming
  0 siblings, 0 replies; 3+ messages in thread
From: Matt Fleming @ 2018-02-07 13:59 UTC (permalink / raw)
  To: Atish Patra; +Cc: linux-kernel, joelaf, jbacik, mingo, peterz, efault, urezki

On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
> 
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
> 
> Here are some performance numbers:

I ran this patch through some tests here on the SUSE performance grid
and there's a definite regression for Mike's personal favourite
benchmark, tbench.

Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this
patch on the right.

tbench4
                                 4.15.0-rc9             4.15.0-rc9
                                    vanillasched-minimize-idle-cpu-window
Min       mb/sec-1        484.50 (   0.00%)      463.03 (  -4.43%)
Min       mb/sec-2        961.43 (   0.00%)      959.35 (  -0.22%)
Min       mb/sec-4       1789.60 (   0.00%)     1760.21 (  -1.64%)
Min       mb/sec-8       3518.51 (   0.00%)     3471.47 (  -1.34%)
Min       mb/sec-16      5521.12 (   0.00%)     5409.77 (  -2.02%)
Min       mb/sec-32      7268.61 (   0.00%)     7491.29 (   3.06%)
Min       mb/sec-64     14413.45 (   0.00%)    14347.69 (  -0.46%)
Min       mb/sec-128    13501.84 (   0.00%)    13413.82 (  -0.65%)
Min       mb/sec-192    13237.02 (   0.00%)    13231.43 (  -0.04%)
Hmean     mb/sec-1        505.20 (   0.00%)      485.81 (  -3.84%)
Hmean     mb/sec-2        973.12 (   0.00%)      970.67 (  -0.25%)
Hmean     mb/sec-4       1835.22 (   0.00%)     1788.54 (  -2.54%)
Hmean     mb/sec-8       3529.35 (   0.00%)     3487.20 (  -1.19%)
Hmean     mb/sec-16      5531.16 (   0.00%)     5437.43 (  -1.69%)
Hmean     mb/sec-32      7627.96 (   0.00%)     8021.26 (   5.16%)
Hmean     mb/sec-64     14441.20 (   0.00%)    14395.08 (  -0.32%)
Hmean     mb/sec-128    13620.40 (   0.00%)    13569.17 (  -0.38%)
Hmean     mb/sec-192    13265.26 (   0.00%)    13263.98 (  -0.01%)
Max       mb/sec-1        510.30 (   0.00%)      489.89 (  -4.00%)
Max       mb/sec-2        989.45 (   0.00%)      976.10 (  -1.35%)
Max       mb/sec-4       1845.65 (   0.00%)     1795.50 (  -2.72%)
Max       mb/sec-8       3574.03 (   0.00%)     3547.56 (  -0.74%)
Max       mb/sec-16      5556.99 (   0.00%)     5564.80 (   0.14%)
Max       mb/sec-32      7678.18 (   0.00%)     8098.63 (   5.48%)
Max       mb/sec-64     14463.07 (   0.00%)    14437.58 (  -0.18%)
Max       mb/sec-128    13659.67 (   0.00%)    13602.65 (  -0.42%)
Max       mb/sec-192    13612.01 (   0.00%)    13832.98 (   1.62%)

There's a nice little performance bump around the 32-client mark.
Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores,
2 threads) each. So 32 clients is the point at which things no longer
fit on a single node.

It doesn't look like the regression is caused by the schedule() path
being slightly longer (i.e. it's not a latency issue) because schbench
results show improvements for the low-end:

schbench
                                 4.15.0-rc9             4.15.0-rc9
                                    vanillasched-minimize-idle-cpu-window
Lat 50.00th-qrtle-1        46.00 (   0.00%)       36.00 (  21.74%)
Lat 75.00th-qrtle-1        49.00 (   0.00%)       37.00 (  24.49%)
Lat 90.00th-qrtle-1        52.00 (   0.00%)       38.00 (  26.92%)
Lat 95.00th-qrtle-1        56.00 (   0.00%)       41.00 (  26.79%)
Lat 99.00th-qrtle-1        61.00 (   0.00%)       46.00 (  24.59%)
Lat 99.50th-qrtle-1        63.00 (   0.00%)       48.00 (  23.81%)
Lat 99.90th-qrtle-1        77.00 (   0.00%)       64.00 (  16.88%)
Lat 50.00th-qrtle-2        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-2        47.00 (   0.00%)       46.00 (   2.13%)
Lat 90.00th-qrtle-2        50.00 (   0.00%)       49.00 (   2.00%)
Lat 95.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
Lat 99.00th-qrtle-2        58.00 (   0.00%)       57.00 (   1.72%)
Lat 99.50th-qrtle-2        60.00 (   0.00%)       59.00 (   1.67%)
Lat 99.90th-qrtle-2        72.00 (   0.00%)       69.00 (   4.17%)
Lat 50.00th-qrtle-4        46.00 (   0.00%)       45.00 (   2.17%)
Lat 75.00th-qrtle-4        49.00 (   0.00%)       48.00 (   2.04%)
Lat 90.00th-qrtle-4        52.00 (   0.00%)       51.00 (   1.92%)
Lat 95.00th-qrtle-4        55.00 (   0.00%)       53.00 (   3.64%)
Lat 99.00th-qrtle-4        61.00 (   0.00%)       59.00 (   3.28%)
Lat 99.50th-qrtle-4        63.00 (   0.00%)       61.00 (   3.17%)
Lat 99.90th-qrtle-4        69.00 (   0.00%)       74.00 (  -7.25%)
Lat 50.00th-qrtle-8        48.00 (   0.00%)       50.00 (  -4.17%)
Lat 75.00th-qrtle-8        52.00 (   0.00%)       54.00 (  -3.85%)
Lat 90.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
Lat 95.00th-qrtle-8        57.00 (   0.00%)       61.00 (  -7.02%)
Lat 99.00th-qrtle-8        64.00 (   0.00%)       68.00 (  -6.25%)
Lat 99.50th-qrtle-8        67.00 (   0.00%)       72.00 (  -7.46%)
Lat 99.90th-qrtle-8        81.00 (   0.00%)       81.00 (   0.00%)
Lat 50.00th-qrtle-16       50.00 (   0.00%)       47.00 (   6.00%)
Lat 75.00th-qrtle-16       59.00 (   0.00%)       57.00 (   3.39%)
Lat 90.00th-qrtle-16       66.00 (   0.00%)       65.00 (   1.52%)
Lat 95.00th-qrtle-16       69.00 (   0.00%)       68.00 (   1.45%)
Lat 99.00th-qrtle-16       76.00 (   0.00%)       75.00 (   1.32%)
Lat 99.50th-qrtle-16       79.00 (   0.00%)       79.00 (   0.00%)
Lat 99.90th-qrtle-16       86.00 (   0.00%)       89.00 (  -3.49%)
Lat 50.00th-qrtle-23       52.00 (   0.00%)       52.00 (   0.00%)
Lat 75.00th-qrtle-23       65.00 (   0.00%)       65.00 (   0.00%)
Lat 90.00th-qrtle-23       75.00 (   0.00%)       74.00 (   1.33%)
Lat 95.00th-qrtle-23       81.00 (   0.00%)       79.00 (   2.47%)
Lat 99.00th-qrtle-23       95.00 (   0.00%)       90.00 (   5.26%)
Lat 99.50th-qrtle-23    12624.00 (   0.00%)     1050.00 (  91.68%)
Lat 99.90th-qrtle-23    15184.00 (   0.00%)    13872.00 (   8.64%)

If you'd like to run these tests on your own machines they're all
available at https://github.com/gormanm/mmtests.git.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-02-07 13:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05 19:09 [PATCH RFC v2] Fix race window during idle cpu selection Atish Patra
2017-12-05 19:09 ` [PATCH RFC v2] sched: Minimize the idle cpu selection race window Atish Patra
2018-02-07 13:59   ` Matt Fleming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).