* [PATCH RFC 0/2] Fix race window during idle cpu selection.
@ 2017-10-31 5:27 Atish Patra
2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
0 siblings, 2 replies; 25+ messages in thread
From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw)
To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo
The 1st patch actually fixes the issue. The 2nd patch adds a new element
in schedstat intended only for testing.
Atish Patra (2):
sched: Minimize the idle cpu selection race window.
sched: Add a stat for idle cpu selection race window.
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 12 +++++++++---
kernel/sched/idle_task.c | 1 +
kernel/sched/sched.h | 2 ++
kernel/sched/stats.c | 5 +++--
5 files changed, 21 insertions(+), 5 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
@ 2017-10-31 5:27 ` Atish Patra
2017-10-31 8:20 ` Peter Zijlstra
2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
1 sibling, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw)
To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo
Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.
Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Atish Patra <atish.patra@oracle.com>
---
kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 12 +++++++++---
kernel/sched/idle_task.c | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2288a14..d9d501c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p)
return p->prio - MAX_RT_PRIO;
}
+DEFINE_PER_CPU(int, claim_wakeup);
/**
* idle_cpu - is a given CPU idle currently?
* @cpu: the processor in question.
@@ -3917,6 +3918,9 @@ int idle_cpu(int cpu)
return 0;
#endif
+ if (per_cpu(claim_wakeup, cpu))
+ return 0;
+
return 1;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13393bb..885023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
idle = false;
}
- if (idle)
+ if (idle) {
+ per_cpu(claim_wakeup, core) = 1;
return core;
+ }
}
/*
@@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu(cpu, cpu_smt_mask(target)) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
+ if (idle_cpu(cpu)) {
+ per_cpu(claim_wakeup, cpu) = 1;
return cpu;
+ }
}
return -1;
@@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
+ if (idle_cpu(cpu)) {
+ per_cpu(claim_wakeup, cpu) = 1;
break;
+ }
}
time = local_clock() - time;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..64d6495 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
put_prev_task(rq, prev);
update_idle_core(rq);
+ this_cpu_write(claim_wakeup, 0);
schedstat_inc(rq->sched_goidle);
return rq->idle;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8aa24b4..5f70b98 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
struct sched_group_capacity {
atomic_t ref;
--
2.7.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH DEBUG 2/2] sched: Add a stat for idle cpu selection race window.
2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
@ 2017-10-31 5:27 ` Atish Patra
1 sibling, 0 replies; 25+ messages in thread
From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw)
To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo
This is ** Debug ** only patch not intended for merging.
A new stat in schedstat is added that represents number of
times cpu was already claimed during wakeup while some other
cpu tries to schedule tasks on it again. It helps to verify
if the concerned issue is present in a specific becnhmark.
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Atish Patra <atish.patra@oracle.com>
---
kernel/sched/core.c | 4 +++-
kernel/sched/sched.h | 1 +
kernel/sched/stats.c | 5 +++--
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9d501c..d12b8c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3918,8 +3918,10 @@ int idle_cpu(int cpu)
return 0;
#endif
- if (per_cpu(claim_wakeup, cpu))
+ if (per_cpu(claim_wakeup, cpu)) {
+ schedstat_inc(rq->ttwu_claimed);
return 0;
+ }
return 1;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f70b98..8f3047c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -801,6 +801,7 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int ttwu_claimed;
#endif
#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 87e2c9f..2290aa7 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -30,12 +30,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay,
+ rq->rq_sched_info.pcount, rq->ttwu_claimed);
seq_printf(seq, "\n");
--
2.7.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
@ 2017-10-31 8:20 ` Peter Zijlstra
2017-10-31 8:48 ` Mike Galbraith
2017-11-05 0:58 ` Joel Fernandes
0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-10-31 8:20 UTC (permalink / raw)
To: Atish Patra; +Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
>
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
The very most important question; does it actually help? What
benchmarks, give what numbers?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-10-31 8:20 ` Peter Zijlstra
@ 2017-10-31 8:48 ` Mike Galbraith
2017-11-01 6:08 ` Atish Patra
2017-11-05 0:58 ` Joel Fernandes
1 sibling, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-10-31 8:48 UTC (permalink / raw)
To: Peter Zijlstra, Atish Patra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote:
> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > Currently, multiple tasks can wakeup on same cpu from
> > select_idle_sibiling() path in case they wakeup simulatenously
> > and last ran on the same llc. This happens because an idle cpu
> > is not updated until idle task is scheduled out. Any task waking
> > during that period may potentially select that cpu for a wakeup
> > candidate.
> >
> > Introduce a per cpu variable that is set as soon as a cpu is
> > selected for wakeup for any task. This prevents from other tasks
> > to select the same cpu again. Note: This does not close the race
> > window but minimizes it to accessing the per-cpu variable. If two
> > wakee tasks access the per cpu variable at the same time, they may
> > select the same cpu again. But it minimizes the race window
> > considerably.
>
> The very most important question; does it actually help? What
> benchmarks, give what numbers?
I played with something ~similar (cmpxchg() idle cpu reservation) a
while back in the context of schbench, and it did help that, but for
generic fast mover benchmarks, the added overhead had the expected
effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat).
I still have the patch lying about in my rubbish heap, but didn't
bother to save any of the test results.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-10-31 8:48 ` Mike Galbraith
@ 2017-11-01 6:08 ` Atish Patra
2017-11-01 6:54 ` Mike Galbraith
0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-01 6:08 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On 10/31/2017 03:48 AM, Mike Galbraith wrote:
> On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
Here are the numbers from one of the OLTP configuration on a 8 socket
x86 machine
kernel txn/minute (normalized) user/sys
baseline 1.0 80/5
pcpu 1.021 84/5
The throughput gains are not very high and close to run-to-run variation %.
The schedstat data (added for testing in 2/2 patch) indicates the there
are many instances of the
race conditions that got addressed but may be not enough to trigger a
significant throughput change.
All other benchmark I tested (TPCC, hackbench, schbench, swingbench) did
not show any regression.
I will let Joel post numbers from Android benchmarks.
> I played with something ~similar (cmpxchg() idle cpu reservation)
I had an atomic version earlier as well. Peter's suggestion for per cpu
seems to perform slightly better than atomic.
Thus, this patch has the per cpu version.
> a
> while back in the context of schbench, and it did help that,
Do you have the schbench configuration somewhere that I can test? I
tried various configurations but did not
see any improvement or regression.
> but for
> generic fast mover benchmarks, the added overhead had the expected
> effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat).
which benchmark ? Is it hackbench or something else ?
I have not found any regression yet in my testing. I would be happy to
test if any other benchmark or different configuration
for hackbench.
Regards,
Atish
> I still have the patch lying about in my rubbish heap, but didn't
> bother to save any of the test results.
>
> -Mike
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-01 6:08 ` Atish Patra
@ 2017-11-01 6:54 ` Mike Galbraith
2017-11-01 7:18 ` Mike Galbraith
0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01 6:54 UTC (permalink / raw)
To: Atish Patra, Peter Zijlstra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
>
> On 10/31/2017 03:48 AM, Mike Galbraith wrote:
>
> > I played with something ~similar (cmpxchg() idle cpu reservation)
> I had an atomic version earlier as well. Peter's suggestion for per cpu
> seems to perform slightly better than atomic.
Yeah, no doubt about that.
> Thus, this patch has the per cpu version.
> > a
> > while back in the context of schbench, and it did help that,
> Do you have the schbench configuration somewhere that I can test? I
> tried various configurations but did not
> see any improvement or regression.
No, as noted, I didn't save anything. I watched and fiddled with
several configurations on various sized boxen, doing this/that on top
of the reservation thing to try to improve the "it's 100% about perfect
spread" schbench, but didn't like the pricetag, so tossed the lot over
my shoulder and walked away.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-01 6:54 ` Mike Galbraith
@ 2017-11-01 7:18 ` Mike Galbraith
2017-11-01 16:36 ` Atish Patra
0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01 7:18 UTC (permalink / raw)
To: Atish Patra, Peter Zijlstra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote:
> On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
>
> > Do you have the schbench configuration somewhere that I can test? I
> > tried various configurations but did not
> > see any improvement or regression.
>
> No, as noted, I didn't save anything. I watched and fiddled with
> several configurations on various sized boxen, doing this/that on top
> of the reservation thing to try to improve the "it's 100% about perfect
> spread" schbench, but didn't like the pricetag, so tossed the lot over
> my shoulder and walked away.
BTW, the biggest trouble with schbench at that time was not so much
about wakeup time stacking (though it was visible), it was using load
averages instead of instantaneous state, and sawtoothing therein.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-01 7:18 ` Mike Galbraith
@ 2017-11-01 16:36 ` Atish Patra
2017-11-01 20:20 ` Mike Galbraith
0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-01 16:36 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On 11/01/2017 02:18 AM, Mike Galbraith wrote:
> On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote:
>> On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
>>
>>> Do you have the schbench configuration somewhere that I can test? I
>>> tried various configurations but did not
>>> see any improvement or regression.
>> No, as noted, I didn't save anything. I watched and fiddled with
>> several configurations on various sized boxen, doing this/that on top
>> of the reservation thing to try to improve the "it's 100% about perfect
>> spread" schbench, but didn't like the pricetag, so tossed the lot over
>> my shoulder and walked away.
:)
> BTW, the biggest trouble with schbench at that time was not so much
> about wakeup time stacking (though it was visible), it was using load
> averages instead of instantaneous state, and sawtoothing therein.
Ahh I see. Any other benchmark suggestion that may benefit from this fix ?
Regards,
Atish
> -Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-01 16:36 ` Atish Patra
@ 2017-11-01 20:20 ` Mike Galbraith
0 siblings, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01 20:20 UTC (permalink / raw)
To: Atish Patra, Peter Zijlstra
Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo
On Wed, 2017-11-01 at 11:36 -0500, Atish Patra wrote:
>
> Any other benchmark suggestion that may benefit from this fix ?
Nope. I'll be kinda surprised if you find anything where you don't
have to squint to see a dinky signal modulating white noise :)
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-10-31 8:20 ` Peter Zijlstra
2017-10-31 8:48 ` Mike Galbraith
@ 2017-11-05 0:58 ` Joel Fernandes
2017-11-22 5:23 ` Atish Patra
1 sibling, 1 reply; 25+ messages in thread
From: Joel Fernandes @ 2017-11-05 0:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Atish Patra, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
Hi Peter,
On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>> Currently, multiple tasks can wakeup on same cpu from
>> select_idle_sibiling() path in case they wakeup simulatenously
>> and last ran on the same llc. This happens because an idle cpu
>> is not updated until idle task is scheduled out. Any task waking
>> during that period may potentially select that cpu for a wakeup
>> candidate.
>>
>> Introduce a per cpu variable that is set as soon as a cpu is
>> selected for wakeup for any task. This prevents from other tasks
>> to select the same cpu again. Note: This does not close the race
>> window but minimizes it to accessing the per-cpu variable. If two
>> wakee tasks access the per cpu variable at the same time, they may
>> select the same cpu again. But it minimizes the race window
>> considerably.
>
> The very most important question; does it actually help? What
> benchmarks, give what numbers?
I collected some numbers with an Android benchmark called Jankbench.
Most tests didn't show an improvement or degradation with the patch.
However, one of the tests called "list view", consistently shows an
improvement. Particularly striking is the improvement at mean and 25
percentile.
For list_view test, Jankbench pulls up a list of text and scrolls the
list, this exercises the display pipeline in Android to render and
display the animation as the scroll happens. For Android, lower frame
times is considered quite important as that means we are less likely
to drop frames and give the user a good experience vs a perceivable
poor experience.
For each frame, Jankbench measures the total time a frame takes and
stores it in a DB (the time from which the app starts drawing, to when
the rendering completes and the frame is submitted for display).
Following is the distribution of frame times in ms.
count 16304 (@60 fps, 4.5 minutes)
Without patch With patch
mean 5.196633 4.429641 (+14.75%)
std 2.030054 2.310025
25% 5.606810 1.991017 (+64.48%)
50% 5.824013 5.716631 (+1.84%)
75% 5.987102 5.932751 (+0.90%)
95% 6.461230 6.301318 (+2.47%)
99% 9.828959 9.697076 (+1.34%)
Note that although Android uses energy aware scheduling patches, I
turned those off to bring the test as close to mainline as possible. I
also backported Vincent's and Brendan's slow path fixes to the 4.4
kernel that the Pixel 2 uses.
Personally I am in favor of this patch considering this test data but
also that in the past, I remember that our teams had to deal with the
same race issue and used cpusets to avoid it (although they probably
tested with "energy aware" CPU selection kept on).
thanks,
- Joel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-05 0:58 ` Joel Fernandes
@ 2017-11-22 5:23 ` Atish Patra
2017-11-23 10:52 ` Uladzislau Rezki
0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-22 5:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
Here are the results of schbench(scheduler latency benchmark) and uperf
(networking benchmark).
Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k
Overall, both benchmark seems to happiest when number of threads are
closer to number of cpus.
--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
Base(4.14) Base+pcpu
Num
Worker Mean Stdev Mean Stdev Improvement (%)
10 3026.8 4987.12 523 474.35 82.7210255055
18 13854.6 1841.61 12945.6 125.19 6.5609977913
19 16457 2046.51 12985.4 48.46 21.0949747828
20 14995 2368.84 15838 2038.82 -5.621873958
25 29952.2 107.72 29673.6 337.57 0.9301487036
30 30084 19.768 30096.2 7.782 -0.0405531179
-------------------------------------------------------------------------------------------------------------------
The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.
If number of threads are equal or higher than number of cpus, it results
in significantly
higher latencies in because of the nature of the benchmark. Results for
higher
threads use case are presented to provide a complete picture but it is
difficult to conclude
anything from that.
Next individual percentile results are present for each use case. The
proposed fix also
improves latency across all percentiles for configuration(19 worker
threads) which
should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
Baseline(4.14) Base+pcpu
Num
Worker Mean stdev Mean stdev Improvement(%)
50th
10 64.2 2.039 63.6 1.743 0.934
18 57.6 5.388 57 4.939 1.041
19 63 4.774 58 4 7.936
20 59.6 4.127 60.2 5.153 -1.006
25 78.4 0.489 78.2 0.748 0.255
30 96.2 0.748 96.4 1.019 -0.207
75th
10 72 3.033 71.6 2.939 0.555
18 78 2.097 77.2 2.135 1.025
19 81.6 1.2 79.4 0.8 2.696
20 81 1.264 80.4 2.332 0.740
25 109.6 1.019 110 0 -0.364
30 781.4 50.902 731.8 70.6382 6.3475
90th
10 80.4 3.666 80.6 2.576 -0.248
18 87.8 1.469 88 1.673 -0.227
19 92.8 0.979 90.6 0.489 2.370
20 92.6 1.019 92 2 0.647
25 8977.6 1277.160 9014.4 467.857 -0.409
30 9558.4 334.641 9507.2 320.383 0.5356
95th
10 86.8 3.867 87.6 4.409 -0.921
18 95.4 1.496 95.2 2.039 0.209
19 102.6 1.624 99 0.894 3.508
20 103.2 1.326 102.2 2.481 0.968
25 12400 78.383 12406.4 37.318 -0.051
30 12336 40.477 12310.4 12.8 0.207
99th
10 99.2 5.418 103.4 6.887 -4.233
18 115.2 2.561 114.6 3.611 0.5208
19 126.25 4.573 120.4 3.872 4.6336
20 145.4 3.09 133 1.41 8.5281
25 12988.8 15.676 12924.8 25.6 0.4927
30 12988.8 15.676 12956.8 32.633 0.2463
99.50th
10 104.4 5.161 109.8 7.909 -5.172
18 127.6 7.391 124.2 4.214 2.6645
19 2712.2 4772.883 133.6 5.571 95.074
20 3707.8 2831.954 2844.2 4708.345 23.291
25 14032 1283.834 13008 0 7.2976
30 16550.4 886.382 13840 1218.355 16.376
------------------------------------------------------------------------------------------------------------------------
Results from uperf
uperf config: Loopback ping pong test with message size = 8k
Baseline (4.14) Baseline +pcpu
Mean stdev Mean stdev Improvement(%)
1 9.056 0.02 8.966 0.083 -0.993
2 17.664 0.13 17.448 0.303 -1.222
4 32.03 0.22 31.972 0.129 -0.181
8 58.198 0.31 58.588 0.198 0.670
16 101.018 0.67 100.056 0.455 -0.952
32 148.1 15.41 164.494 2.312 11.069
64 203.66 1.16 203.042 1.348 -0.3073
128 197.12 1.04 194.722 1.174 -1.2165
The race window fix seems to help uperf for 32 threads (closest to
number of cpus) as well.
Regards,
Atish
On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> Hi Peter,
>
> On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
> I collected some numbers with an Android benchmark called Jankbench.
> Most tests didn't show an improvement or degradation with the patch.
> However, one of the tests called "list view", consistently shows an
> improvement. Particularly striking is the improvement at mean and 25
> percentile.
>
> For list_view test, Jankbench pulls up a list of text and scrolls the
> list, this exercises the display pipeline in Android to render and
> display the animation as the scroll happens. For Android, lower frame
> times is considered quite important as that means we are less likely
> to drop frames and give the user a good experience vs a perceivable
> poor experience.
>
> For each frame, Jankbench measures the total time a frame takes and
> stores it in a DB (the time from which the app starts drawing, to when
> the rendering completes and the frame is submitted for display).
> Following is the distribution of frame times in ms.
>
> count 16304 (@60 fps, 4.5 minutes)
>
> Without patch With patch
> mean 5.196633 4.429641 (+14.75%)
> std 2.030054 2.310025
> 25% 5.606810 1.991017 (+64.48%)
> 50% 5.824013 5.716631 (+1.84%)
> 75% 5.987102 5.932751 (+0.90%)
> 95% 6.461230 6.301318 (+2.47%)
> 99% 9.828959 9.697076 (+1.34%)
>
> Note that although Android uses energy aware scheduling patches, I
> turned those off to bring the test as close to mainline as possible. I
> also backported Vincent's and Brendan's slow path fixes to the 4.4
> kernel that the Pixel 2 uses.
>
> Personally I am in favor of this patch considering this test data but
> also that in the past, I remember that our teams had to deal with the
> same race issue and used cpusets to avoid it (although they probably
> tested with "energy aware" CPU selection kept on).
>
> thanks,
>
> - Joel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-22 5:23 ` Atish Patra
@ 2017-11-23 10:52 ` Uladzislau Rezki
2017-11-23 13:13 ` Mike Galbraith
0 siblings, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-23 10:52 UTC (permalink / raw)
To: Atish Patra, Peter Zijlstra
Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
Hello, Atish, Peter, all.
I have a question about if a task's nr_cpus_allowed is 1.
In that scenario we do not call select_task_rq. Therefore
even thought a task "p" is placed on idle CPU that CPU
will not be marked as claimed for wake-up.
What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
to select_task_rq() instead and possibly get rid of them from
other places (increases a race window a bit)?
Also, it will help and improve an ILB decision for NO_HZ idle
balance. To be more clear i mean:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..8111cfa20075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
!cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
+ per_cpu(claim_wakeup, cpu) = 1;
return cpu;
}
@@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p)
return p->prio - MAX_RT_PRIO;
}
+DEFINE_PER_CPU(int, claim_wakeup);
/**
* idle_cpu - is a given CPU idle currently?
* @cpu: the processor in question.
@@ -3909,6 +3911,9 @@ int idle_cpu(int cpu)
return 0;
#endif
+ if (per_cpu(claim_wakeup, cpu))
+ return 0;
+
return 1;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c09ddf8c832..55ad7fa97e95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,7 +8596,17 @@ static struct {
static inline int find_new_ilb(void)
{
- int ilb = cpumask_first(nohz.idle_cpus_mask);
+ cpumask_t cpumask;
+ int ilb;
+
+ cpumask_copy(&cpumask, nohz.idle_cpus_mask);
+
+ /*
+ * Probably is not good approach in case of many CPUs
+ */
+ for_each_cpu(ilb, &cpumask)
+ if (!per_cpu(claim_wakeup, ilb))
+ break;
if (ilb < nr_cpu_ids && idle_cpu(ilb))
return ilb;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..91cf50cab836 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
put_prev_task(rq, prev);
update_idle_core(rq);
+ this_cpu_write(claim_wakeup, 0);
schedstat_inc(rq->sched_goidle);
return rq->idle;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b448ba82225..0e0bb7536725 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
struct sched_group_capacity {
atomic_t ref;
--
Uladzislau Rezki
On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote:
> Here are the results of schbench(scheduler latency benchmark) and uperf
> (networking benchmark).
>
> Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
> schbench config: message threads = 2; time = 180s, worker thread = variable
> uperf config:ping pong test on loopback interface with message size = 8k
>
> Overall, both benchmark seems to happiest when number of threads are closer
> to number of cpus.
> --------------------------------------------------------------------------------------------------------------------------
> schbench Maximum Latency(lower is better):
> Base(4.14) Base+pcpu
> Num
> Worker Mean Stdev Mean Stdev Improvement (%)
> 10 3026.8 4987.12 523 474.35 82.7210255055
> 18 13854.6 1841.61 12945.6 125.19 6.5609977913
> 19 16457 2046.51 12985.4 48.46 21.0949747828
> 20 14995 2368.84 15838 2038.82 -5.621873958
> 25 29952.2 107.72 29673.6 337.57 0.9301487036
> 30 30084 19.768 30096.2 7.782 -0.0405531179
>
> -------------------------------------------------------------------------------------------------------------------
>
> The proposed fix seems to improve the maximum latency for lower number
> of threads. It also seems to reduce the variation(lower stdev) as well.
>
> If number of threads are equal or higher than number of cpus, it results in
> significantly
> higher latencies in because of the nature of the benchmark. Results for
> higher
> threads use case are presented to provide a complete picture but it is
> difficult to conclude
> anything from that.
>
> Next individual percentile results are present for each use case. The
> proposed fix also
> improves latency across all percentiles for configuration(19 worker threads)
> which
> should saturate the system.
> ---------------------------------------------------------------------------------------------------------------------
> schbench Latency in usec(lower is better)
> Baseline(4.14) Base+pcpu
> Num
> Worker Mean stdev Mean stdev Improvement(%)
>
> 50th
> 10 64.2 2.039 63.6 1.743 0.934
> 18 57.6 5.388 57 4.939 1.041
> 19 63 4.774 58 4 7.936
> 20 59.6 4.127 60.2 5.153 -1.006
> 25 78.4 0.489 78.2 0.748 0.255
> 30 96.2 0.748 96.4 1.019 -0.207
>
> 75th
> 10 72 3.033 71.6 2.939 0.555
> 18 78 2.097 77.2 2.135 1.025
> 19 81.6 1.2 79.4 0.8 2.696
> 20 81 1.264 80.4 2.332 0.740
> 25 109.6 1.019 110 0 -0.364
> 30 781.4 50.902 731.8 70.6382 6.3475
>
> 90th
> 10 80.4 3.666 80.6 2.576 -0.248
> 18 87.8 1.469 88 1.673 -0.227
> 19 92.8 0.979 90.6 0.489 2.370
> 20 92.6 1.019 92 2 0.647
> 25 8977.6 1277.160 9014.4 467.857 -0.409
> 30 9558.4 334.641 9507.2 320.383 0.5356
>
> 95th
> 10 86.8 3.867 87.6 4.409 -0.921
> 18 95.4 1.496 95.2 2.039 0.209
> 19 102.6 1.624 99 0.894 3.508
> 20 103.2 1.326 102.2 2.481 0.968
> 25 12400 78.383 12406.4 37.318 -0.051
> 30 12336 40.477 12310.4 12.8 0.207
>
> 99th
> 10 99.2 5.418 103.4 6.887 -4.233
> 18 115.2 2.561 114.6 3.611 0.5208
> 19 126.25 4.573 120.4 3.872 4.6336
> 20 145.4 3.09 133 1.41 8.5281
> 25 12988.8 15.676 12924.8 25.6 0.4927
> 30 12988.8 15.676 12956.8 32.633 0.2463
>
> 99.50th
> 10 104.4 5.161 109.8 7.909 -5.172
> 18 127.6 7.391 124.2 4.214 2.6645
> 19 2712.2 4772.883 133.6 5.571 95.074
> 20 3707.8 2831.954 2844.2 4708.345 23.291
> 25 14032 1283.834 13008 0 7.2976
> 30 16550.4 886.382 13840 1218.355 16.376
> ------------------------------------------------------------------------------------------------------------------------
>
>
> Results from uperf
> uperf config: Loopback ping pong test with message size = 8k
>
> Baseline (4.14) Baseline +pcpu
> Mean stdev Mean stdev Improvement(%)
> 1 9.056 0.02 8.966 0.083 -0.993
> 2 17.664 0.13 17.448 0.303 -1.222
> 4 32.03 0.22 31.972 0.129 -0.181
> 8 58.198 0.31 58.588 0.198 0.670
> 16 101.018 0.67 100.056 0.455 -0.952
> 32 148.1 15.41 164.494 2.312 11.069
> 64 203.66 1.16 203.042 1.348 -0.3073
> 128 197.12 1.04 194.722 1.174 -1.2165
>
> The race window fix seems to help uperf for 32 threads (closest to number of
> cpus) as well.
>
> Regards,
> Atish
>
> On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> > Hi Peter,
> >
> > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > > > Currently, multiple tasks can wakeup on same cpu from
> > > > select_idle_sibiling() path in case they wakeup simulatenously
> > > > and last ran on the same llc. This happens because an idle cpu
> > > > is not updated until idle task is scheduled out. Any task waking
> > > > during that period may potentially select that cpu for a wakeup
> > > > candidate.
> > > >
> > > > Introduce a per cpu variable that is set as soon as a cpu is
> > > > selected for wakeup for any task. This prevents from other tasks
> > > > to select the same cpu again. Note: This does not close the race
> > > > window but minimizes it to accessing the per-cpu variable. If two
> > > > wakee tasks access the per cpu variable at the same time, they may
> > > > select the same cpu again. But it minimizes the race window
> > > > considerably.
> > > The very most important question; does it actually help? What
> > > benchmarks, give what numbers?
> > I collected some numbers with an Android benchmark called Jankbench.
> > Most tests didn't show an improvement or degradation with the patch.
> > However, one of the tests called "list view", consistently shows an
> > improvement. Particularly striking is the improvement at mean and 25
> > percentile.
> >
> > For list_view test, Jankbench pulls up a list of text and scrolls the
> > list, this exercises the display pipeline in Android to render and
> > display the animation as the scroll happens. For Android, lower frame
> > times is considered quite important as that means we are less likely
> > to drop frames and give the user a good experience vs a perceivable
> > poor experience.
> >
> > For each frame, Jankbench measures the total time a frame takes and
> > stores it in a DB (the time from which the app starts drawing, to when
> > the rendering completes and the frame is submitted for display).
> > Following is the distribution of frame times in ms.
> >
> > count 16304 (@60 fps, 4.5 minutes)
> >
> > Without patch With patch
> > mean 5.196633 4.429641 (+14.75%)
> > std 2.030054 2.310025
> > 25% 5.606810 1.991017 (+64.48%)
> > 50% 5.824013 5.716631 (+1.84%)
> > 75% 5.987102 5.932751 (+0.90%)
> > 95% 6.461230 6.301318 (+2.47%)
> > 99% 9.828959 9.697076 (+1.34%)
> >
> > Note that although Android uses energy aware scheduling patches, I
> > turned those off to bring the test as close to mainline as possible. I
> > also backported Vincent's and Brendan's slow path fixes to the 4.4
> > kernel that the Pixel 2 uses.
> >
> > Personally I am in favor of this patch considering this test data but
> > also that in the past, I remember that our teams had to deal with the
> > same race issue and used cpusets to avoid it (although they probably
> > tested with "energy aware" CPU selection kept on).
> >
> > thanks,
> >
> > - Joel
>
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-23 10:52 ` Uladzislau Rezki
@ 2017-11-23 13:13 ` Mike Galbraith
2017-11-23 16:00 ` Josef Bacik
2017-11-24 10:26 ` Uladzislau Rezki
0 siblings, 2 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-23 13:13 UTC (permalink / raw)
To: Uladzislau Rezki, Atish Patra, Peter Zijlstra
Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> Hello, Atish, Peter, all.
>
> I have a question about if a task's nr_cpus_allowed is 1.
> In that scenario we do not call select_task_rq. Therefore
> even thought a task "p" is placed on idle CPU that CPU
> will not be marked as claimed for wake-up.
>
> What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> to select_task_rq() instead and possibly get rid of them from
> other places (increases a race window a bit)?
My thoughts on all of this is that we need less SIS, not more. Rather
than trying so hard for the absolute lowest wakeup latency, which
induces throughput/efficiency robbing bouncing, I think we'd be better
of considering leaving an already llc affine task where it is if the
average cycle time is sufficiently low that it will likely hit the CPU
RSN. Completely ignoring low utilization kernel threads would go a
long way to getting rid of bouncing userspace (which tends to have a
meaningful footprint), all over hell and creation.
You could also periodically send mobile kthreads down the slow path to
try to keep them the hell away from partially busy CPUs, as well as
anything else that hasn't run for a while, to keep background cruft
from continually injecting itself into the middle of a cross core
cyber-sex.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-23 13:13 ` Mike Galbraith
@ 2017-11-23 16:00 ` Josef Bacik
2017-11-23 17:40 ` Mike Galbraith
2017-11-23 21:11 ` Atish Patra
2017-11-24 10:26 ` Uladzislau Rezki
1 sibling, 2 replies; 25+ messages in thread
From: Josef Bacik @ 2017-11-23 16:00 UTC (permalink / raw)
To: Mike Galbraith
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> > Hello, Atish, Peter, all.
> >
> > I have a question about if a task's nr_cpus_allowed is 1.
> > In that scenario we do not call select_task_rq. Therefore
> > even thought a task "p" is placed on idle CPU that CPU
> > will not be marked as claimed for wake-up.
> >
> > What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> > to select_task_rq() instead and possibly get rid of them from
> > other places (increases a race window a bit)?
>
> My thoughts on all of this is that we need less SIS, not more. Rather
> than trying so hard for the absolute lowest wakeup latency, which
> induces throughput/efficiency robbing bouncing, I think we'd be better
> of considering leaving an already llc affine task where it is if the
> average cycle time is sufficiently low that it will likely hit the CPU
> RSN. Completely ignoring low utilization kernel threads would go a
> long way to getting rid of bouncing userspace (which tends to have a
> meaningful footprint), all over hell and creation.
>
> You could also periodically send mobile kthreads down the slow path to
> try to keep them the hell away from partially busy CPUs, as well as
> anything else that hasn't run for a while, to keep background cruft
> from continually injecting itself into the middle of a cross core
> cyber-sex.
>
And on this thanksgiving I'm thankful for Mike, and his entertaining early
morning emails.
Josef
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-23 16:00 ` Josef Bacik
@ 2017-11-23 17:40 ` Mike Galbraith
2017-11-23 21:11 ` Atish Patra
1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-23 17:40 UTC (permalink / raw)
To: Josef Bacik
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Thu, 2017-11-23 at 11:00 -0500, Josef Bacik wrote:
>
> And on this thanksgiving I'm thankful for Mike, and his entertaining early
> morning emails.
Read it again tomorrow.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-23 16:00 ` Josef Bacik
2017-11-23 17:40 ` Mike Galbraith
@ 2017-11-23 21:11 ` Atish Patra
1 sibling, 0 replies; 25+ messages in thread
From: Atish Patra @ 2017-11-23 21:11 UTC (permalink / raw)
To: Josef Bacik, Mike Galbraith
Cc: Uladzislau Rezki, Peter Zijlstra, Joel Fernandes, LKML,
Brendan Jackman, Josef Bacik, Ingo Molnar
On 2017/11/23 10:00 AM, Josef Bacik wrote:
> On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
>> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
>>> Hello, Atish, Peter, all.
>>>
>>> I have a question about if a task's nr_cpus_allowed is 1.
>>> In that scenario we do not call select_task_rq. Therefore
>>> even thought a task "p" is placed on idle CPU that CPU
>>> will not be marked as claimed for wake-up.
>>>
>>> What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
>>> to select_task_rq() instead and possibly get rid of them from
>>> other places (increases a race window a bit)?
>> My thoughts on all of this is that we need less SIS, not more. Rather
>> than trying so hard for the absolute lowest wakeup latency, which
>> induces throughput/efficiency robbing bouncing, I think we'd be better
>> of considering leaving an already llc affine task where it is if the
>> average cycle time is sufficiently low that it will likely hit the CPU
>> RSN. Completely ignoring low utilization kernel threads would go a
>> long way to getting rid of bouncing userspace (which tends to have a
>> meaningful footprint), all over hell and creation.
>>
>> You could also periodically send mobile kthreads down the slow path to
>> try to keep them the hell away from partially busy CPUs, as well as
>> anything else that hasn't run for a while, to keep background cruft
>> from continually injecting itself into the middle of a cross core
>> cyber-sex.
>>
> And on this thanksgiving I'm thankful for Mike, and his entertaining early
> morning emails.
:) :).
> Josef
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-23 13:13 ` Mike Galbraith
2017-11-23 16:00 ` Josef Bacik
@ 2017-11-24 10:26 ` Uladzislau Rezki
2017-11-24 18:46 ` Mike Galbraith
1 sibling, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-24 10:26 UTC (permalink / raw)
To: Mike Galbraith
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> > Hello, Atish, Peter, all.
> >
> > I have a question about if a task's nr_cpus_allowed is 1.
> > In that scenario we do not call select_task_rq. Therefore
> > even thought a task "p" is placed on idle CPU that CPU
> > will not be marked as claimed for wake-up.
> >
> > What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> > to select_task_rq() instead and possibly get rid of them from
> > other places (increases a race window a bit)?
>
> My thoughts on all of this is that we need less SIS, not more. Rather
> than trying so hard for the absolute lowest wakeup latency, which
> induces throughput/efficiency robbing bouncing, I think we'd be better
> of considering leaving an already llc affine task where it is if the
> average cycle time is sufficiently low that it will likely hit the CPU
> RSN.
I guess there is misunderstanding here. The main goal is not to cover
pinned case, for sure. I was thinking more about below points:
- Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
predictable (that is good for mobile workloads). Because as it is
right now it simply returns a first CPU in a "nohz" mask and if we
know that CPU has been claimed i think it is worth to go with another
ILB core, since waking up on a remote CPU + doing nohz_idle_balance
does not improve wake-up latency and is a miss from ilb point of view.
- Get rid of duplication;
- Be not limited to pinned case.
If you have any proposal, i would be appreciated if you could
share your specific view.
> Completely ignoring low utilization kernel threads would go a
> long way to getting rid of bouncing userspace (which tends to have a
> meaningful footprint), all over hell and creation.
>
> You could also periodically send mobile kthreads down the slow path to
> try to keep them the hell away from partially busy CPUs, as well as
> anything else that hasn't run for a while, to keep background cruft
> from continually injecting itself into the middle of a cross core
> cyber-sex.
CPU is not considered idle (in terms of idle_cpu()) until it hits a
first rule checking a condition if current is idle thread or not. If we
hit last check when a claim wake-up is set it means that CPU switches
from idle state to non-idle one and it will happen quite soon depending
on wake-up latency.
Considering a core as not-idle when somebody tends to wake up a task on
it is a good point. If you have any specific example when it is bad,
please share it.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-24 10:26 ` Uladzislau Rezki
@ 2017-11-24 18:46 ` Mike Galbraith
2017-11-26 20:58 ` Mike Galbraith
2017-11-28 9:34 ` Uladzislau Rezki
0 siblings, 2 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-24 18:46 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
Brendan Jackman, Josef Bacik, Ingo Molnar
On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote:
>
> I guess there is misunderstanding here. The main goal is not to cover
> pinned case, for sure. I was thinking more about below points:
>
> - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
> predictable (that is good for mobile workloads). Because as it is
> right now it simply returns a first CPU in a "nohz" mask and if we
> know that CPU has been claimed i think it is worth to go with another
> ILB core, since waking up on a remote CPU + doing nohz_idle_balance
> does not improve wake-up latency and is a miss from ilb point of view.
Using unsynchronized access of a flag set by chaotic events all over
the box?
> If you have any proposal, i would be appreciated if you could
> share your specific view.
My view is you're barking up the wrong tree: you're making the idle
data SIS is using more accurate, but I question the benefit. That it
makes an imperfect placement decision occasionally due to raciness is
nearly meaningless compared to the cost of frequent bounce. Better
(but still not perfect, that requires atomics) state information
doesn't meaningfully improve decision quality.
> Considering a core as not-idle when somebody tends to wake up a task on
> it is a good point. If you have any specific example when it is bad,
> please share it.
I'm not sure how that will work out. Probably like most everything wrt
SIS, first comes "woohoo" then "aw crap", and off to the trash it goes.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-24 18:46 ` Mike Galbraith
@ 2017-11-26 20:58 ` Mike Galbraith
2017-11-28 9:34 ` Uladzislau Rezki
1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-26 20:58 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
Brendan Jackman, Josef Bacik, Ingo Molnar
On Fri, 2017-11-24 at 19:46 +0100, Mike Galbraith wrote:
>
> My view is you're barking up the wrong tree: you're making the idle
> data SIS is using more accurate, but I question the benefit. That it
> makes an imperfect placement decision occasionally due to raciness is
> nearly meaningless compared to the cost of frequent bounce.
Playing with SIS (yet again), below is a hack that I think illustrates
why I think the occasional races are nearly meaningless.
Box = i4790 desktop.
master masterx
TCP_STREAM-1 Avg: 70495 71295
TCP_STREAM-2 Avg: 54388 66202
TCP_STREAM-4 Avg: 19316 21413
TCP_STREAM-8 Avg: 9678 8894
TCP_STREAM-16 Avg: 4286 4360
TCP_MAERTS-1 Avg: 69238 71799
TCP_MAERTS-2 Avg: 50729 65612
TCP_MAERTS-4 Avg: 19095 21984
TCP_MAERTS-8 Avg: 9405 8888
TCP_MAERTS-16 Avg: 4891 4371
TCP_RR-1 Avg: 198617 203291
TCP_RR-2 Avg: 152862 191761
TCP_RR-4 Avg: 112241 117888
TCP_RR-8 Avg: 104453 113260
TCP_RR-16 Avg: 50897 55280
UDP_RR-1 Avg: 250738 264214
UDP_RR-2 Avg: 196250 253352
UDP_RR-4 Avg: 152862 158819
UDP_RR-8 Avg: 143781 154071
UDP_RR-16 Avg: 68605 76492
tbench 1 2 4 8 16
master 772 1207 1829 3516 3440
masterx 811 1466 1959 3737 3670
hackbench -l 10000
5.917 5.990 5.957 avg 5.954 NO_SIS_DEBOUNCE
5.886 5.808 5.826 avg 5.840 SIS_DEBOUNCE
echo 0 > tracing_on
echo 1 > events/sched/sched_migrate_task/enable
start endless tbench 2
for i in `seq 3`
do
echo > trace
echo 1 > tracing_on
sleep 10
echo 0 > tracing_on
cat trace|grep tbench|wc -l
done
kde desktop idling
NO_SIS_DEBOUNCE 261 208 199
SIS_DEBOUNCE 8 6 0
add firefox playing youtube documentary
NO_SIS_DEBOUNCE 10906 10094 10774
SIS_DEBOUNCE 34 34 34
tbench 2 throughput as firefox runs
NO_SIS_DEBOUNCE 1129.63 MB/sec
SIS_DEBOUNCE 1462.53 MB/sec
Advisory: welding goggles.
---
include/linux/sched.h | 3 ++-
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
3 files changed, 42 insertions(+), 3 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -541,7 +541,6 @@ struct task_struct {
unsigned int ptrace;
#ifdef CONFIG_SMP
- struct llist_node wake_entry;
int on_cpu;
#ifdef CONFIG_THREAD_INFO_IN_TASK
/* Current CPU: */
@@ -549,8 +548,10 @@ struct task_struct {
#endif
unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ unsigned long wakee_placed;
struct task_struct *last_wakee;
+ struct llist_node wake_entry;
int wake_cpu;
#endif
int on_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6174,6 +6174,9 @@ static int select_idle_sibling(struct ta
if ((unsigned)i < nr_cpumask_bits)
return i;
+ if (sched_feat(SIS_DEBOUNCE))
+ p->wakee_placed = jiffies;
+
return target;
}
@@ -6258,6 +6261,22 @@ static int wake_cap(struct task_struct *
return min_cap * 1024 < task_util(p) * capacity_margin;
}
+static bool task_placed(struct task_struct *p)
+{
+ return p->wakee_placed == jiffies;
+}
+
+static bool task_llc_affine_and_cold(struct task_struct *p, int cpu, int prev)
+{
+ int cold = sysctl_sched_migration_cost;
+
+ if (!cpus_share_cache(cpu, prev))
+ return false;
+ if (cold > 0 && rq_clock_task(cpu_rq(prev)) - p->se.exec_start > cold)
+ return true;
+ return false;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6276,16 +6295,26 @@ select_task_rq_fair(struct task_struct *
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
- int want_affine = 0;
+ int want_affine = 0, want_debounce = 0;
int sync = wake_flags & WF_SYNC;
+ rcu_read_lock();
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
+ want_debounce = sched_feat(SIS_DEBOUNCE);
+ if (task_placed(p))
+ goto out_unlock;
+ /* Balance cold tasks to reduce hot task bounce tendency. */
+ if (want_debounce && task_llc_affine_and_cold(p, cpu, prev_cpu)) {
+ sd_flag |= SD_SHARE_PKG_RESOURCES;
+ sd = highest_flag_domain(prev_cpu, SD_SHARE_PKG_RESOURCES);
+ p->wakee_placed = jiffies;
+ goto pick_cpu_cold;
+ }
want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
&& cpumask_test_cpu(cpu, &p->cpus_allowed);
}
- rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6315,6 +6344,7 @@ select_task_rq_fair(struct task_struct *
new_cpu = cpu;
}
+pick_cpu_cold:
if (sd && !(sd_flag & SD_BALANCE_FORK)) {
/*
* We're going to need the task's util for capacity_spare_wake
@@ -6329,9 +6359,13 @@ select_task_rq_fair(struct task_struct *
if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ if (want_debounce && new_cpu == prev_cpu)
+ p->wakee_placed = jiffies;
+
} else {
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
}
+out_unlock:
rcu_read_unlock();
return new_cpu;
@@ -6952,6 +6986,9 @@ static int task_hot(struct task_struct *
if (sysctl_sched_migration_cost == 0)
return 0;
+ if (task_placed(p))
+ return 1;
+
delta = rq_clock_task(env->src_rq) - p->se.exec_start;
return delta < (s64)sysctl_sched_migration_cost;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_DEBOUNCE, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-24 18:46 ` Mike Galbraith
2017-11-26 20:58 ` Mike Galbraith
@ 2017-11-28 9:34 ` Uladzislau Rezki
2017-11-28 10:49 ` Mike Galbraith
1 sibling, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-28 9:34 UTC (permalink / raw)
To: Mike Galbraith
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote:
> >
> > I guess there is misunderstanding here. The main goal is not to cover
> > pinned case, for sure. I was thinking more about below points:
> >
> > - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
> > predictable (that is good for mobile workloads). Because as it is
> > right now it simply returns a first CPU in a "nohz" mask and if we
> > know that CPU has been claimed i think it is worth to go with another
> > ILB core, since waking up on a remote CPU + doing nohz_idle_balance
> > does not improve wake-up latency and is a miss from ilb point of view.
>
> Using unsynchronized access of a flag set by chaotic events all over
> the box?
Well, entering NO_HZ mode and checking nohz.idle_cpus_mask
nohz_balancer_kick() -> find_new_ilb() are obviously asynchronous
events and like you say is chaotic except of accessing to the
cpumask variable (i mean atomically). So do not see any issues
with that.
>
> > If you have any proposal, i would be appreciated if you could
> > share your specific view.
>
> My view is you're barking up the wrong tree: you're making the idle
> data SIS is using more accurate, but I question the benefit. That it
> makes an imperfect placement decision occasionally due to raciness is
> nearly meaningless compared to the cost of frequent bounce.
Before sitting down and start testing, i just illustrated how we can
apply claim_wake_up to ilb asking community a specific view on it:
drawbacks, pros/cons, proposals etc.
>
> Better (but still not perfect, that requires atomics) state information
> doesn't meaningfully improve decision quality.
>
I agree, it should be atomic, something like below:
core.c
+/* cpus with claim wake-up set bit */
+cpumask_t cpu_claim_wake_up_map;
...
@@ -1559,6 +1562,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
!cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
+ cpumask_test_and_set_cpu(cpu, &cpu_claim_wake_up_map);
return cpu;
}
...
+ if (cpumask_test_cpu(cpu, &cpu_claim_wake_up_map))
+ return 0;
+
return 1;
}
fair.c:
static inline int find_new_ilb(void)
{
- int ilb = cpumask_first(nohz.idle_cpus_mask);
+ cpumask_t cpumask;
+ int ilb;
+
+ cpumask_andnot(&cpumask, nohz.idle_cpus_mask,
+ &cpu_claim_wake_up_map);
+
+ ilb = cpumask_first(&cpumask);
> > Considering a core as not-idle when somebody tends to wake up a task on
> > it is a good point. If you have any specific example when it is bad,
> > please share it.
>
> I'm not sure how that will work out. Probably like most everything wrt
> SIS, first comes "woohoo" then "aw crap", and off to the trash it goes.
>
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-28 9:34 ` Uladzislau Rezki
@ 2017-11-28 10:49 ` Mike Galbraith
2017-11-29 10:41 ` Uladzislau Rezki
0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-28 10:49 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
Brendan Jackman, Josef Bacik, Ingo Molnar
On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
>
> > My view is you're barking up the wrong tree: you're making the idle
> > data SIS is using more accurate, but I question the benefit. That it
> > makes an imperfect placement decision occasionally due to raciness is
> > nearly meaningless compared to the cost of frequent bounce.
> Before sitting down and start testing, i just illustrated how we can
> apply claim_wake_up to ilb asking community a specific view on it:
> drawbacks, pros/cons, proposals etc.
Even if you make the thing atomic, what is ILB supposed to do, look
over its shoulder every step of the way and sh*t it's pants if somebody
touches claim_wake_up as it's about to or just after it did something?
If you intend to make all of LB race free, good luck.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-28 10:49 ` Mike Galbraith
@ 2017-11-29 10:41 ` Uladzislau Rezki
2017-11-29 18:15 ` Mike Galbraith
0 siblings, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-29 10:41 UTC (permalink / raw)
To: Mike Galbraith
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> >
> > > My view is you're barking up the wrong tree: you're making the idle
> > > data SIS is using more accurate, but I question the benefit. That it
> > > makes an imperfect placement decision occasionally due to raciness is
> > > nearly meaningless compared to the cost of frequent bounce.
>
> > Before sitting down and start testing, i just illustrated how we can
> > apply claim_wake_up to ilb asking community a specific view on it:
> > drawbacks, pros/cons, proposals etc.
>
> Even if you make the thing atomic, what is ILB supposed to do, look
> over its shoulder every step of the way and sh*t it's pants if somebody
> touches claim_wake_up as it's about to or just after it did something?
If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
it means that TIF_NEED_RESCHED is coming or is already in place. When a
CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
unless it runs into scheduler_ipi or so.
>
> If you intend to make all of LB race free, good luck.
>
> -Mike
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-29 10:41 ` Uladzislau Rezki
@ 2017-11-29 18:15 ` Mike Galbraith
2017-11-30 12:30 ` Uladzislau Rezki
0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-29 18:15 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
Brendan Jackman, Josef Bacik, Ingo Molnar
On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote:
> On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> > >
> > > > My view is you're barking up the wrong tree: you're making the idle
> > > > data SIS is using more accurate, but I question the benefit. That it
> > > > makes an imperfect placement decision occasionally due to raciness is
> > > > nearly meaningless compared to the cost of frequent bounce.
> >
> > > Before sitting down and start testing, i just illustrated how we can
> > > apply claim_wake_up to ilb asking community a specific view on it:
> > > drawbacks, pros/cons, proposals etc.
> >
> > Even if you make the thing atomic, what is ILB supposed to do, look
> > over its shoulder every step of the way and sh*t it's pants if somebody
> > touches claim_wake_up as it's about to or just after it did something?
> If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
> it means that TIF_NEED_RESCHED is coming or is already in place. When a
> CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
> unless it runs into scheduler_ipi or so.
Which means nothing to an LB operation in progress.
But whatever, I'm not going to argue endlessly about something I think
should be blatantly obvious. IMO, this is a couple points shy of
pointless. That's my 'C' to this RFC in a nutshell. I'm done.
-Mike
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
2017-11-29 18:15 ` Mike Galbraith
@ 2017-11-30 12:30 ` Uladzislau Rezki
0 siblings, 0 replies; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-30 12:30 UTC (permalink / raw)
To: Mike Galbraith
Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
LKML, Brendan Jackman, Josef Bacik, Ingo Molnar
On Wed, Nov 29, 2017 at 07:15:21PM +0100, Mike Galbraith wrote:
> On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote:
> > On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> > > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> > > >
> > > > > My view is you're barking up the wrong tree: you're making the idle
> > > > > data SIS is using more accurate, but I question the benefit. That it
> > > > > makes an imperfect placement decision occasionally due to raciness is
> > > > > nearly meaningless compared to the cost of frequent bounce.
> > >
> > > > Before sitting down and start testing, i just illustrated how we can
> > > > apply claim_wake_up to ilb asking community a specific view on it:
> > > > drawbacks, pros/cons, proposals etc.
> > >
> > > Even if you make the thing atomic, what is ILB supposed to do, look
> > > over its shoulder every step of the way and sh*t it's pants if somebody
> > > touches claim_wake_up as it's about to or just after it did something?
> > If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
> > it means that TIF_NEED_RESCHED is coming or is already in place. When a
> > CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
> > unless it runs into scheduler_ipi or so.
>
> Which means nothing to an LB operation in progress.
>
<snip>
static void nohz_idle_balance(...)
...
/*
* If this cpu gets work to do, stop the load balancing
* work being done for other cpus. Next load
* balancing owner will pick it up.
*/
if (need_resched())
break;
...
<snip>
>
> But whatever, I'm not going to argue endlessly about something I think
> should be blatantly obvious. IMO, this is a couple points shy of
> pointless. That's my 'C' to this RFC in a nutshell. I'm done.
>
Thank you.
Vlad
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2017-11-30 12:30 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
2017-10-31 8:20 ` Peter Zijlstra
2017-10-31 8:48 ` Mike Galbraith
2017-11-01 6:08 ` Atish Patra
2017-11-01 6:54 ` Mike Galbraith
2017-11-01 7:18 ` Mike Galbraith
2017-11-01 16:36 ` Atish Patra
2017-11-01 20:20 ` Mike Galbraith
2017-11-05 0:58 ` Joel Fernandes
2017-11-22 5:23 ` Atish Patra
2017-11-23 10:52 ` Uladzislau Rezki
2017-11-23 13:13 ` Mike Galbraith
2017-11-23 16:00 ` Josef Bacik
2017-11-23 17:40 ` Mike Galbraith
2017-11-23 21:11 ` Atish Patra
2017-11-24 10:26 ` Uladzislau Rezki
2017-11-24 18:46 ` Mike Galbraith
2017-11-26 20:58 ` Mike Galbraith
2017-11-28 9:34 ` Uladzislau Rezki
2017-11-28 10:49 ` Mike Galbraith
2017-11-29 10:41 ` Uladzislau Rezki
2017-11-29 18:15 ` Mike Galbraith
2017-11-30 12:30 ` Uladzislau Rezki
2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).