All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] Fix race window during idle cpu selection.
@ 2017-10-31  5:27 Atish Patra
  2017-10-31  5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
  2017-10-31  5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
  0 siblings, 2 replies; 25+ messages in thread
From: Atish Patra @ 2017-10-31  5:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo

The 1st patch actually fixes the issue. The 2nd patch adds a new element
in schedstat intended only for testing.

Atish Patra (2):
  sched: Minimize the idle cpu selection race window.
  sched: Add a stat for idle cpu selection race window.

 kernel/sched/core.c      |  6 ++++++
 kernel/sched/fair.c      | 12 +++++++++---
 kernel/sched/idle_task.c |  1 +
 kernel/sched/sched.h     |  2 ++
 kernel/sched/stats.c     |  5 +++--
 5 files changed, 21 insertions(+), 5 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-10-31  5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
@ 2017-10-31  5:27 ` Atish Patra
  2017-10-31  8:20   ` Peter Zijlstra
  2017-10-31  5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
  1 sibling, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-10-31  5:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo

Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Atish Patra <atish.patra@oracle.com>
---
 kernel/sched/core.c      |  4 ++++
 kernel/sched/fair.c      | 12 +++++++++---
 kernel/sched/idle_task.c |  1 +
 kernel/sched/sched.h     |  1 +
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2288a14..d9d501c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p)
 	return p->prio - MAX_RT_PRIO;
 }
 
+DEFINE_PER_CPU(int, claim_wakeup);
 /**
  * idle_cpu - is a given CPU idle currently?
  * @cpu: the processor in question.
@@ -3917,6 +3918,9 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
+	if (per_cpu(claim_wakeup, cpu))
+		return 0;
+
 	return 1;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13393bb..885023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 				idle = false;
 		}
 
-		if (idle)
+		if (idle) {
+			per_cpu(claim_wakeup, core) = 1;
 			return core;
+		}
 	}
 
 	/*
@@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
+		if (idle_cpu(cpu)) {
+			per_cpu(claim_wakeup, cpu) = 1;
 			return cpu;
+		}
 	}
 
 	return -1;
@@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
+		if (idle_cpu(cpu)) {
+			per_cpu(claim_wakeup, cpu) = 1;
 			break;
+		}
 	}
 
 	time = local_clock() - time;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..64d6495 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	put_prev_task(rq, prev);
 	update_idle_core(rq);
+	this_cpu_write(claim_wakeup, 0);
 	schedstat_inc(rq->sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8aa24b4..5f70b98 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
 
 struct sched_group_capacity {
 	atomic_t ref;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH DEBUG 2/2] sched: Add a stat for idle cpu selection race window.
  2017-10-31  5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
  2017-10-31  5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
@ 2017-10-31  5:27 ` Atish Patra
  1 sibling, 0 replies; 25+ messages in thread
From: Atish Patra @ 2017-10-31  5:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo

This is ** Debug ** only patch not intended for merging.

A new stat in schedstat is added that represents number of
times cpu was already claimed during wakeup while some other
cpu tries to schedule tasks on it again. It helps to verify
if the concerned issue is present in a specific becnhmark.

Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Atish Patra <atish.patra@oracle.com>
---
 kernel/sched/core.c  | 4 +++-
 kernel/sched/sched.h | 1 +
 kernel/sched/stats.c | 5 +++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9d501c..d12b8c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3918,8 +3918,10 @@ int idle_cpu(int cpu)
 		return 0;
 #endif
 
-	if (per_cpu(claim_wakeup, cpu))
+	if (per_cpu(claim_wakeup, cpu)) {
+		schedstat_inc(rq->ttwu_claimed);
 		return 0;
+	}
 
 	return 1;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f70b98..8f3047c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -801,6 +801,7 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int ttwu_count;
 	unsigned int ttwu_local;
+	unsigned int ttwu_claimed;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 87e2c9f..2290aa7 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -30,12 +30,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay,
+		    rq->rq_sched_info.pcount, rq->ttwu_claimed);
 
 		seq_printf(seq, "\n");
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-10-31  5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
@ 2017-10-31  8:20   ` Peter Zijlstra
  2017-10-31  8:48     ` Mike Galbraith
  2017-11-05  0:58     ` Joel Fernandes
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-10-31  8:20 UTC (permalink / raw)
  To: Atish Patra; +Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo

On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
> 
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.

The very most important question; does it actually help? What
benchmarks, give what numbers?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-10-31  8:20   ` Peter Zijlstra
@ 2017-10-31  8:48     ` Mike Galbraith
  2017-11-01  6:08       ` Atish Patra
  2017-11-05  0:58     ` Joel Fernandes
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-10-31  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Atish Patra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo

On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote:
> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > Currently, multiple tasks can wakeup on same cpu from
> > select_idle_sibiling() path in case they wakeup simulatenously
> > and last ran on the same llc. This happens because an idle cpu
> > is not updated until idle task is scheduled out. Any task waking
> > during that period may potentially select that cpu for a wakeup
> > candidate.
> > 
> > Introduce a per cpu variable that is set as soon as a cpu is
> > selected for wakeup for any task. This prevents from other tasks
> > to select the same cpu again. Note: This does not close the race
> > window but minimizes it to accessing the per-cpu variable. If two
> > wakee tasks access the per cpu variable at the same time, they may
> > select the same cpu again. But it minimizes the race window
> > considerably.
> 
> The very most important question; does it actually help? What
> benchmarks, give what numbers?

I played with something ~similar (cmpxchg() idle cpu reservation) a
while back in the context of schbench, and it did help that, but for
generic fast mover benchmarks, the added overhead had the expected
effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat).

I still have the patch lying about in my rubbish heap, but didn't
bother to save any of the test results.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-10-31  8:48     ` Mike Galbraith
@ 2017-11-01  6:08       ` Atish Patra
  2017-11-01  6:54         ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-01  6:08 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo



On 10/31/2017 03:48 AM, Mike Galbraith wrote:
> On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
Here are the numbers from one of the OLTP configuration on a 8 socket 
x86 machine
kernel          txn/minute (normalized)    user/sys
baseline      1.0                                          80/5
pcpu            1.021                                      84/5

The throughput gains are not very high and close to run-to-run variation %.
The schedstat data (added for testing in 2/2 patch) indicates the there 
are many instances of the
race conditions that got addressed but may be not enough to trigger a 
significant throughput change.

All other benchmark I tested (TPCC, hackbench, schbench, swingbench) did 
not show any regression.

I will let Joel post numbers from Android benchmarks.
> I played with something ~similar (cmpxchg() idle cpu reservation)
I had an atomic version earlier as well. Peter's suggestion for per cpu 
seems to perform slightly better than atomic.
Thus, this patch has the per cpu version.
>   a
> while back in the context of schbench, and it did help that,
Do you have the schbench configuration somewhere that I can test? I 
tried various configurations but did not
see any improvement or regression.
> but for
> generic fast mover benchmarks, the added overhead had the expected
> effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat).
which benchmark ? Is it hackbench or something else ?
I have not found any regression yet in my testing. I would be happy to 
test if any other benchmark or different configuration
for hackbench.

Regards,
Atish
> I still have the patch lying about in my rubbish heap, but didn't
> bother to save any of the test results.
>
> 	-Mike
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-01  6:08       ` Atish Patra
@ 2017-11-01  6:54         ` Mike Galbraith
  2017-11-01  7:18           ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01  6:54 UTC (permalink / raw)
  To: Atish Patra, Peter Zijlstra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo

On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
> 
> On 10/31/2017 03:48 AM, Mike Galbraith wrote:
> 
> > I played with something ~similar (cmpxchg() idle cpu reservation)
> I had an atomic version earlier as well. Peter's suggestion for per cpu 
> seems to perform slightly better than atomic.

Yeah, no doubt about that.

> Thus, this patch has the per cpu version.
> >   a
> > while back in the context of schbench, and it did help that,
> Do you have the schbench configuration somewhere that I can test? I 
> tried various configurations but did not
> see any improvement or regression.

No, as noted, I didn't save anything.  I watched and fiddled with
several configurations on various sized boxen, doing this/that on top
of the reservation thing to try to improve the "it's 100% about perfect
spread" schbench, but didn't like the pricetag, so tossed the lot over
my shoulder and walked away.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-01  6:54         ` Mike Galbraith
@ 2017-11-01  7:18           ` Mike Galbraith
  2017-11-01 16:36             ` Atish Patra
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01  7:18 UTC (permalink / raw)
  To: Atish Patra, Peter Zijlstra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo

On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote:
> On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
> 
> > Do you have the schbench configuration somewhere that I can test? I 
> > tried various configurations but did not
> > see any improvement or regression.
> 
> No, as noted, I didn't save anything.  I watched and fiddled with
> several configurations on various sized boxen, doing this/that on top
> of the reservation thing to try to improve the "it's 100% about perfect
> spread" schbench, but didn't like the pricetag, so tossed the lot over
> my shoulder and walked away.

BTW, the biggest trouble with schbench at that time was not so much
about wakeup time stacking (though it was visible), it was using load
averages instead of instantaneous state, and sawtoothing therein.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-01  7:18           ` Mike Galbraith
@ 2017-11-01 16:36             ` Atish Patra
  2017-11-01 20:20               ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-01 16:36 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo



On 11/01/2017 02:18 AM, Mike Galbraith wrote:
> On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote:
>> On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote:
>>
>>> Do you have the schbench configuration somewhere that I can test? I
>>> tried various configurations but did not
>>> see any improvement or regression.
>> No, as noted, I didn't save anything.  I watched and fiddled with
>> several configurations on various sized boxen, doing this/that on top
>> of the reservation thing to try to improve the "it's 100% about perfect
>> spread" schbench, but didn't like the pricetag, so tossed the lot over
>> my shoulder and walked away.
:)
> BTW, the biggest trouble with schbench at that time was not so much
> about wakeup time stacking (though it was visible), it was using load
> averages instead of instantaneous state, and sawtoothing therein.
Ahh I see. Any other benchmark suggestion that may benefit from this fix ?

Regards,
Atish
> 	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-01 16:36             ` Atish Patra
@ 2017-11-01 20:20               ` Mike Galbraith
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-01 20:20 UTC (permalink / raw)
  To: Atish Patra, Peter Zijlstra
  Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo

On Wed, 2017-11-01 at 11:36 -0500, Atish Patra wrote:
> 
> Any other benchmark suggestion that may benefit from this fix ?

Nope.  I'll be kinda surprised if you find anything where you don't
have to squint to see a dinky signal modulating white noise :)

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-10-31  8:20   ` Peter Zijlstra
  2017-10-31  8:48     ` Mike Galbraith
@ 2017-11-05  0:58     ` Joel Fernandes
  2017-11-22  5:23       ` Atish Patra
  1 sibling, 1 reply; 25+ messages in thread
From: Joel Fernandes @ 2017-11-05  0:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Atish Patra, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

Hi Peter,

On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>> Currently, multiple tasks can wakeup on same cpu from
>> select_idle_sibiling() path in case they wakeup simulatenously
>> and last ran on the same llc. This happens because an idle cpu
>> is not updated until idle task is scheduled out. Any task waking
>> during that period may potentially select that cpu for a wakeup
>> candidate.
>>
>> Introduce a per cpu variable that is set as soon as a cpu is
>> selected for wakeup for any task. This prevents from other tasks
>> to select the same cpu again. Note: This does not close the race
>> window but minimizes it to accessing the per-cpu variable. If two
>> wakee tasks access the per cpu variable at the same time, they may
>> select the same cpu again. But it minimizes the race window
>> considerably.
>
> The very most important question; does it actually help? What
> benchmarks, give what numbers?

I collected some numbers with an Android benchmark called Jankbench.
Most tests didn't show an improvement or degradation with the patch.
However, one of the tests called "list view",  consistently shows an
improvement. Particularly striking is the improvement at mean and 25
percentile.

For list_view test, Jankbench pulls up a list of text and scrolls the
list, this exercises the display pipeline in Android to render and
display the animation as the scroll happens. For Android, lower frame
times is considered quite important as that means we are less likely
to drop frames and give the user a good experience vs a perceivable
poor experience.

For each frame, Jankbench measures the total time a frame takes and
stores it in a DB (the time from which the app starts drawing, to when
the rendering completes and the frame is submitted for display).
Following is the distribution of frame times in ms.

count    16304   (@60 fps, 4.5 minutes)

        Without patch   With patch
mean         5.196633   4.429641 (+14.75%)
std          2.030054   2.310025
25%          5.606810   1.991017 (+64.48%)
50%          5.824013   5.716631 (+1.84%)
75%          5.987102   5.932751 (+0.90%)
95%          6.461230   6.301318 (+2.47%)
99%          9.828959   9.697076 (+1.34%)

Note that although Android uses energy aware scheduling patches, I
turned those off to bring the test as close to mainline as possible. I
also backported Vincent's and Brendan's slow path fixes to the 4.4
kernel that the Pixel 2 uses.

Personally I am in favor of this patch considering this test data but
also that in the past, I remember that our teams had to deal with the
same race issue and used cpusets to avoid it (although they probably
tested with "energy aware" CPU selection kept on).

thanks,

- Joel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-05  0:58     ` Joel Fernandes
@ 2017-11-22  5:23       ` Atish Patra
  2017-11-23 10:52         ` Uladzislau Rezki
  0 siblings, 1 reply; 25+ messages in thread
From: Atish Patra @ 2017-11-22  5:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

Here are the results of schbench(scheduler latency benchmark) and uperf 
(networking benchmark).

Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k

Overall, both benchmark seems to happiest when number of threads are 
closer to number of cpus.
--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
             Base(4.14)                 Base+pcpu
Num
Worker Mean            Stdev        Mean         Stdev Improvement (%)
10       3026.8     4987.12           523         474.35 82.7210255055
18       13854.6   1841.61        12945.6     125.19 6.5609977913
19       16457      2046.51        12985.4     48.46 21.0949747828
20       14995      2368.84        15838       2038.82 -5.621873958
25       29952.2   107.72           29673.6   337.57 0.9301487036
30       30084      19.768           30096.2     7.782 -0.0405531179

-------------------------------------------------------------------------------------------------------------------

The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.

If number of threads are equal or higher than number of cpus, it results 
in significantly
higher latencies in because of the nature of the benchmark. Results for 
higher
threads use case are presented to provide a complete picture but it is 
difficult to conclude
anything from that.

Next individual percentile results are present for each use case. The 
proposed fix also
improves latency across all percentiles for configuration(19 worker 
threads) which
should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
             Baseline(4.14)           Base+pcpu
Num
Worker Mean       stdev        Mean    stdev    Improvement(%)

50th
10       64.2            2.039        63.6    1.743        0.934
18       57.6            5.388        57        4.939        1.041
19       63               4.774        58        4 7.936
20       59.6           4.127        60.2    5.153        -1.006
25       78.4           0.489        78.2    0.748        0.255
30       96.2           0.748        96.4    1.019        -0.207

75th
10        72             3.033        71.6      2.939        0.555
18        78             2.097        77.2      2.135        1.025
19        81.6          1.2            79.4      0.8 2.696
20        81             1.264        80.4      2.332        0.740
25        109.6        1.019       110        0               -0.364
30        781.4        50.902     731.8   70.6382    6.3475

90th
10        80.4          3.666       80.6        2.576        -0.248
18        87.8          1.469       88           1.673        -0.227
19        92.8          0.979       90.6        0.489        2.370
20        92.6          1.019       92            2 0.647
25        8977.6   1277.160   9014.4    467.857   -0.409
30        9558.4   334.641     9507.2   320.383    0.5356

95th
10        86.8         3.867        87.6         4.409        -0.921
18        95.4         1.496        95.2         2.039        0.209
19        102.6       1.624        99            0.894        3.508
20        103.2       1.326        102.2       2.481        0.968
25        12400      78.383      12406.4   37.318    -0.051
30        12336      40.477      12310.4   12.8        0.207

99th
10        99.2          5.418       103.4        6.887        -4.233
18        115.2        2.561       114.6        3.611        0.5208
19        126.25      4.573       120.4        3.872        4.6336
20        145.4        3.09          133           1.41 8.5281
25        12988.8   15.676      12924.8   25.6          0.4927
30        12988.8   15.676      12956.8   32.633     0.2463

99.50th
10        104.4        5.161         109.8    7.909          -5.172
18        127.6        7.391         124.2    4.214          2.6645
19        2712.2      4772.883  133.6    5.571          95.074
20        3707.8      2831.954  2844.2  4708.345    23.291
25        14032       1283.834  13008   0                  7.2976
30        16550.4    886.382    13840   1218.355    16.376
------------------------------------------------------------------------------------------------------------------------ 


Results from uperf
uperf config: Loopback ping pong test with message size = 8k

         Baseline (4.14)            Baseline +pcpu
          Mean        stdev        Mean        stdev Improvement(%)
1      9.056        0.02          8.966        0.083        -0.993
2      17.664      0.13         17.448       0.303       -1.222
4      32.03        0.22         31.972       0.129       -0.181
8      58.198      0.31         58.588       0.198        0.670
16    101.018   0.67        100.056      0.455        -0.952
32    148.1        15.41     164.494      2.312        11.069
64    203.66      1.16       203.042      1.348        -0.3073
128   197.12     1.04       194.722     1.174         -1.2165

The race window fix seems to help uperf for 32 threads (closest to 
number of cpus) as well.

Regards,
Atish

On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> Hi Peter,
>
> On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
>>> Currently, multiple tasks can wakeup on same cpu from
>>> select_idle_sibiling() path in case they wakeup simulatenously
>>> and last ran on the same llc. This happens because an idle cpu
>>> is not updated until idle task is scheduled out. Any task waking
>>> during that period may potentially select that cpu for a wakeup
>>> candidate.
>>>
>>> Introduce a per cpu variable that is set as soon as a cpu is
>>> selected for wakeup for any task. This prevents from other tasks
>>> to select the same cpu again. Note: This does not close the race
>>> window but minimizes it to accessing the per-cpu variable. If two
>>> wakee tasks access the per cpu variable at the same time, they may
>>> select the same cpu again. But it minimizes the race window
>>> considerably.
>> The very most important question; does it actually help? What
>> benchmarks, give what numbers?
> I collected some numbers with an Android benchmark called Jankbench.
> Most tests didn't show an improvement or degradation with the patch.
> However, one of the tests called "list view",  consistently shows an
> improvement. Particularly striking is the improvement at mean and 25
> percentile.
>
> For list_view test, Jankbench pulls up a list of text and scrolls the
> list, this exercises the display pipeline in Android to render and
> display the animation as the scroll happens. For Android, lower frame
> times is considered quite important as that means we are less likely
> to drop frames and give the user a good experience vs a perceivable
> poor experience.
>
> For each frame, Jankbench measures the total time a frame takes and
> stores it in a DB (the time from which the app starts drawing, to when
> the rendering completes and the frame is submitted for display).
> Following is the distribution of frame times in ms.
>
> count    16304   (@60 fps, 4.5 minutes)
>
>          Without patch   With patch
> mean         5.196633   4.429641 (+14.75%)
> std          2.030054   2.310025
> 25%          5.606810   1.991017 (+64.48%)
> 50%          5.824013   5.716631 (+1.84%)
> 75%          5.987102   5.932751 (+0.90%)
> 95%          6.461230   6.301318 (+2.47%)
> 99%          9.828959   9.697076 (+1.34%)
>
> Note that although Android uses energy aware scheduling patches, I
> turned those off to bring the test as close to mainline as possible. I
> also backported Vincent's and Brendan's slow path fixes to the 4.4
> kernel that the Pixel 2 uses.
>
> Personally I am in favor of this patch considering this test data but
> also that in the past, I remember that our teams had to deal with the
> same race issue and used cpusets to avoid it (although they probably
> tested with "energy aware" CPU selection kept on).
>
> thanks,
>
> - Joel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-22  5:23       ` Atish Patra
@ 2017-11-23 10:52         ` Uladzislau Rezki
  2017-11-23 13:13           ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-23 10:52 UTC (permalink / raw)
  To: Atish Patra, Peter Zijlstra
  Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

Hello, Atish, Peter, all.

I have a question about if a task's nr_cpus_allowed is 1.
In that scenario we do not call select_task_rq. Therefore
even thought a task "p" is placed on idle CPU that CPU
will not be marked as claimed for wake-up.

What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
to select_task_rq() instead and possibly get rid of them from
other places (increases a race window a bit)?

Also, it will help and improve an ILB decision for NO_HZ idle
balance. To be more clear i mean:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..8111cfa20075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
             !cpu_online(cpu)))
        cpu = select_fallback_rq(task_cpu(p), p);
+   per_cpu(claim_wakeup, cpu) = 1;
    return cpu;
}

@@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p)
    return p->prio - MAX_RT_PRIO;
 }

+DEFINE_PER_CPU(int, claim_wakeup);
 /**
  * idle_cpu - is a given CPU idle currently?
  * @cpu: the processor in question.
@@ -3909,6 +3911,9 @@ int idle_cpu(int cpu)
        return 0;
 #endif

+   if (per_cpu(claim_wakeup, cpu))
+       return 0;
+
    return 1;
 }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c09ddf8c832..55ad7fa97e95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,7 +8596,17 @@ static struct {

 static inline int find_new_ilb(void)
  {
  -       int ilb = cpumask_first(nohz.idle_cpus_mask);
  +       cpumask_t cpumask;
  +       int ilb;
  +
  +       cpumask_copy(&cpumask, nohz.idle_cpus_mask);
  +
  +       /*
  +        * Probably is not good approach in case of many CPUs
  +        */
  +       for_each_cpu(ilb, &cpumask)
  +               if (!per_cpu(claim_wakeup, ilb))
  +                       break;

          if (ilb < nr_cpu_ids && idle_cpu(ilb))
	          return ilb;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..91cf50cab836 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
        put_prev_task(rq, prev);
	update_idle_core(rq);
+       this_cpu_write(claim_wakeup, 0);
        schedstat_inc(rq->sched_goidle);
        return rq->idle;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b448ba82225..0e0bb7536725 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
    
 struct sched_group_capacity {
        atomic_t ref;

--
Uladzislau Rezki

On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote:
> Here are the results of schbench(scheduler latency benchmark) and uperf
> (networking benchmark).
> 
> Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
> schbench config: message threads = 2; time = 180s, worker thread = variable
> uperf config:ping pong test on loopback interface with message size = 8k
> 
> Overall, both benchmark seems to happiest when number of threads are closer
> to number of cpus.
> --------------------------------------------------------------------------------------------------------------------------
> schbench Maximum Latency(lower is better):
>             Base(4.14)                 Base+pcpu
> Num
> Worker Mean            Stdev        Mean         Stdev Improvement (%)
> 10       3026.8     4987.12           523         474.35 82.7210255055
> 18       13854.6   1841.61        12945.6     125.19 6.5609977913
> 19       16457      2046.51        12985.4     48.46 21.0949747828
> 20       14995      2368.84        15838       2038.82 -5.621873958
> 25       29952.2   107.72           29673.6   337.57 0.9301487036
> 30       30084      19.768           30096.2     7.782 -0.0405531179
> 
> -------------------------------------------------------------------------------------------------------------------
> 
> The proposed fix seems to improve the maximum latency for lower number
> of threads. It also seems to reduce the variation(lower stdev) as well.
> 
> If number of threads are equal or higher than number of cpus, it results in
> significantly
> higher latencies in because of the nature of the benchmark. Results for
> higher
> threads use case are presented to provide a complete picture but it is
> difficult to conclude
> anything from that.
> 
> Next individual percentile results are present for each use case. The
> proposed fix also
> improves latency across all percentiles for configuration(19 worker threads)
> which
> should saturate the system.
> ---------------------------------------------------------------------------------------------------------------------
> schbench Latency in usec(lower is better)
>             Baseline(4.14)           Base+pcpu
> Num
> Worker Mean       stdev        Mean    stdev    Improvement(%)
> 
> 50th
> 10       64.2            2.039        63.6    1.743        0.934
> 18       57.6            5.388        57        4.939        1.041
> 19       63               4.774        58        4 7.936
> 20       59.6           4.127        60.2    5.153        -1.006
> 25       78.4           0.489        78.2    0.748        0.255
> 30       96.2           0.748        96.4    1.019        -0.207
> 
> 75th
> 10        72             3.033        71.6      2.939        0.555
> 18        78             2.097        77.2      2.135        1.025
> 19        81.6          1.2            79.4      0.8 2.696
> 20        81             1.264        80.4      2.332        0.740
> 25        109.6        1.019       110        0               -0.364
> 30        781.4        50.902     731.8   70.6382    6.3475
> 
> 90th
> 10        80.4          3.666       80.6        2.576        -0.248
> 18        87.8          1.469       88           1.673        -0.227
> 19        92.8          0.979       90.6        0.489        2.370
> 20        92.6          1.019       92            2 0.647
> 25        8977.6   1277.160   9014.4    467.857   -0.409
> 30        9558.4   334.641     9507.2   320.383    0.5356
> 
> 95th
> 10        86.8         3.867        87.6         4.409        -0.921
> 18        95.4         1.496        95.2         2.039        0.209
> 19        102.6       1.624        99            0.894        3.508
> 20        103.2       1.326        102.2       2.481        0.968
> 25        12400      78.383      12406.4   37.318    -0.051
> 30        12336      40.477      12310.4   12.8        0.207
> 
> 99th
> 10        99.2          5.418       103.4        6.887        -4.233
> 18        115.2        2.561       114.6        3.611        0.5208
> 19        126.25      4.573       120.4        3.872        4.6336
> 20        145.4        3.09          133           1.41 8.5281
> 25        12988.8   15.676      12924.8   25.6          0.4927
> 30        12988.8   15.676      12956.8   32.633     0.2463
> 
> 99.50th
> 10        104.4        5.161         109.8    7.909          -5.172
> 18        127.6        7.391         124.2    4.214          2.6645
> 19        2712.2      4772.883  133.6    5.571          95.074
> 20        3707.8      2831.954  2844.2  4708.345    23.291
> 25        14032       1283.834  13008   0                  7.2976
> 30        16550.4    886.382    13840   1218.355    16.376
> ------------------------------------------------------------------------------------------------------------------------
> 
> 
> Results from uperf
> uperf config: Loopback ping pong test with message size = 8k
> 
>         Baseline (4.14)            Baseline +pcpu
>          Mean        stdev        Mean        stdev Improvement(%)
> 1      9.056        0.02          8.966        0.083        -0.993
> 2      17.664      0.13         17.448       0.303       -1.222
> 4      32.03        0.22         31.972       0.129       -0.181
> 8      58.198      0.31         58.588       0.198        0.670
> 16    101.018   0.67        100.056      0.455        -0.952
> 32    148.1        15.41     164.494      2.312        11.069
> 64    203.66      1.16       203.042      1.348        -0.3073
> 128   197.12     1.04       194.722     1.174         -1.2165
> 
> The race window fix seems to help uperf for 32 threads (closest to number of
> cpus) as well.
> 
> Regards,
> Atish
> 
> On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> > Hi Peter,
> > 
> > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > > > Currently, multiple tasks can wakeup on same cpu from
> > > > select_idle_sibiling() path in case they wakeup simulatenously
> > > > and last ran on the same llc. This happens because an idle cpu
> > > > is not updated until idle task is scheduled out. Any task waking
> > > > during that period may potentially select that cpu for a wakeup
> > > > candidate.
> > > > 
> > > > Introduce a per cpu variable that is set as soon as a cpu is
> > > > selected for wakeup for any task. This prevents from other tasks
> > > > to select the same cpu again. Note: This does not close the race
> > > > window but minimizes it to accessing the per-cpu variable. If two
> > > > wakee tasks access the per cpu variable at the same time, they may
> > > > select the same cpu again. But it minimizes the race window
> > > > considerably.
> > > The very most important question; does it actually help? What
> > > benchmarks, give what numbers?
> > I collected some numbers with an Android benchmark called Jankbench.
> > Most tests didn't show an improvement or degradation with the patch.
> > However, one of the tests called "list view",  consistently shows an
> > improvement. Particularly striking is the improvement at mean and 25
> > percentile.
> > 
> > For list_view test, Jankbench pulls up a list of text and scrolls the
> > list, this exercises the display pipeline in Android to render and
> > display the animation as the scroll happens. For Android, lower frame
> > times is considered quite important as that means we are less likely
> > to drop frames and give the user a good experience vs a perceivable
> > poor experience.
> > 
> > For each frame, Jankbench measures the total time a frame takes and
> > stores it in a DB (the time from which the app starts drawing, to when
> > the rendering completes and the frame is submitted for display).
> > Following is the distribution of frame times in ms.
> > 
> > count    16304   (@60 fps, 4.5 minutes)
> > 
> >          Without patch   With patch
> > mean         5.196633   4.429641 (+14.75%)
> > std          2.030054   2.310025
> > 25%          5.606810   1.991017 (+64.48%)
> > 50%          5.824013   5.716631 (+1.84%)
> > 75%          5.987102   5.932751 (+0.90%)
> > 95%          6.461230   6.301318 (+2.47%)
> > 99%          9.828959   9.697076 (+1.34%)
> > 
> > Note that although Android uses energy aware scheduling patches, I
> > turned those off to bring the test as close to mainline as possible. I
> > also backported Vincent's and Brendan's slow path fixes to the 4.4
> > kernel that the Pixel 2 uses.
> > 
> > Personally I am in favor of this patch considering this test data but
> > also that in the past, I remember that our teams had to deal with the
> > same race issue and used cpusets to avoid it (although they probably
> > tested with "energy aware" CPU selection kept on).
> > 
> > thanks,
> > 
> > - Joel
> 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-23 10:52         ` Uladzislau Rezki
@ 2017-11-23 13:13           ` Mike Galbraith
  2017-11-23 16:00             ` Josef Bacik
  2017-11-24 10:26             ` Uladzislau Rezki
  0 siblings, 2 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-23 13:13 UTC (permalink / raw)
  To: Uladzislau Rezki, Atish Patra, Peter Zijlstra
  Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> Hello, Atish, Peter, all.
> 
> I have a question about if a task's nr_cpus_allowed is 1.
> In that scenario we do not call select_task_rq. Therefore
> even thought a task "p" is placed on idle CPU that CPU
> will not be marked as claimed for wake-up.
> 
> What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> to select_task_rq() instead and possibly get rid of them from
> other places (increases a race window a bit)?

My thoughts on all of this is that we need less SIS, not more.  Rather
than trying so hard for the absolute lowest wakeup latency, which
induces throughput/efficiency robbing bouncing, I think we'd be better
of considering leaving an already llc affine task where it is if the
average cycle time is sufficiently low that it will likely hit the CPU
RSN.  Completely ignoring low utilization kernel threads would go a
long way to getting rid of bouncing userspace (which tends to have a
meaningful footprint), all over hell and creation.

You could also periodically send mobile kthreads down the slow path to
try to keep them the hell away from partially busy CPUs, as well as
anything else that hasn't run for a while, to keep background cruft
from continually injecting itself into the middle of a cross core
cyber-sex.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-23 13:13           ` Mike Galbraith
@ 2017-11-23 16:00             ` Josef Bacik
  2017-11-23 17:40               ` Mike Galbraith
  2017-11-23 21:11               ` Atish Patra
  2017-11-24 10:26             ` Uladzislau Rezki
  1 sibling, 2 replies; 25+ messages in thread
From: Josef Bacik @ 2017-11-23 16:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> > Hello, Atish, Peter, all.
> > 
> > I have a question about if a task's nr_cpus_allowed is 1.
> > In that scenario we do not call select_task_rq. Therefore
> > even thought a task "p" is placed on idle CPU that CPU
> > will not be marked as claimed for wake-up.
> > 
> > What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> > to select_task_rq() instead and possibly get rid of them from
> > other places (increases a race window a bit)?
> 
> My thoughts on all of this is that we need less SIS, not more.  Rather
> than trying so hard for the absolute lowest wakeup latency, which
> induces throughput/efficiency robbing bouncing, I think we'd be better
> of considering leaving an already llc affine task where it is if the
> average cycle time is sufficiently low that it will likely hit the CPU
> RSN.  Completely ignoring low utilization kernel threads would go a
> long way to getting rid of bouncing userspace (which tends to have a
> meaningful footprint), all over hell and creation.
> 
> You could also periodically send mobile kthreads down the slow path to
> try to keep them the hell away from partially busy CPUs, as well as
> anything else that hasn't run for a while, to keep background cruft
> from continually injecting itself into the middle of a cross core
> cyber-sex.
>

And on this thanksgiving I'm thankful for Mike, and his entertaining early
morning emails.

Josef 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-23 16:00             ` Josef Bacik
@ 2017-11-23 17:40               ` Mike Galbraith
  2017-11-23 21:11               ` Atish Patra
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-23 17:40 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Thu, 2017-11-23 at 11:00 -0500, Josef Bacik wrote:
> 
> And on this thanksgiving I'm thankful for Mike, and his entertaining early
> morning emails.

Read it again tomorrow.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-23 16:00             ` Josef Bacik
  2017-11-23 17:40               ` Mike Galbraith
@ 2017-11-23 21:11               ` Atish Patra
  1 sibling, 0 replies; 25+ messages in thread
From: Atish Patra @ 2017-11-23 21:11 UTC (permalink / raw)
  To: Josef Bacik, Mike Galbraith
  Cc: Uladzislau Rezki, Peter Zijlstra, Joel Fernandes, LKML,
	Brendan Jackman, Josef Bacik, Ingo Molnar



On 2017/11/23 10:00 AM, Josef Bacik wrote:
> On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
>> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
>>> Hello, Atish, Peter, all.
>>>
>>> I have a question about if a task's nr_cpus_allowed is 1.
>>> In that scenario we do not call select_task_rq. Therefore
>>> even thought a task "p" is placed on idle CPU that CPU
>>> will not be marked as claimed for wake-up.
>>>
>>> What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
>>> to select_task_rq() instead and possibly get rid of them from
>>> other places (increases a race window a bit)?
>> My thoughts on all of this is that we need less SIS, not more.  Rather
>> than trying so hard for the absolute lowest wakeup latency, which
>> induces throughput/efficiency robbing bouncing, I think we'd be better
>> of considering leaving an already llc affine task where it is if the
>> average cycle time is sufficiently low that it will likely hit the CPU
>> RSN.  Completely ignoring low utilization kernel threads would go a
>> long way to getting rid of bouncing userspace (which tends to have a
>> meaningful footprint), all over hell and creation.
>>
>> You could also periodically send mobile kthreads down the slow path to
>> try to keep them the hell away from partially busy CPUs, as well as
>> anything else that hasn't run for a while, to keep background cruft
>> from continually injecting itself into the middle of a cross core
>> cyber-sex.
>>
> And on this thanksgiving I'm thankful for Mike, and his entertaining early
> morning emails.
:) :).
> Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-23 13:13           ` Mike Galbraith
  2017-11-23 16:00             ` Josef Bacik
@ 2017-11-24 10:26             ` Uladzislau Rezki
  2017-11-24 18:46               ` Mike Galbraith
  1 sibling, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-24 10:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote:
> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote:
> > Hello, Atish, Peter, all.
> > 
> > I have a question about if a task's nr_cpus_allowed is 1.
> > In that scenario we do not call select_task_rq. Therefore
> > even thought a task "p" is placed on idle CPU that CPU
> > will not be marked as claimed for wake-up.
> > 
> > What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
> > to select_task_rq() instead and possibly get rid of them from
> > other places (increases a race window a bit)?
> 
> My thoughts on all of this is that we need less SIS, not more.  Rather
> than trying so hard for the absolute lowest wakeup latency, which
> induces throughput/efficiency robbing bouncing, I think we'd be better
> of considering leaving an already llc affine task where it is if the
> average cycle time is sufficiently low that it will likely hit the CPU
> RSN.
I guess there is misunderstanding here. The main goal is not to cover
pinned case, for sure. I was thinking more about below points:

- Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
  predictable (that is good for mobile workloads). Because as it is
  right now it simply returns a first CPU in a "nohz" mask and if we
  know that CPU has been claimed i think it is worth to go with another
  ILB core, since waking up on a remote CPU + doing nohz_idle_balance
  does not improve wake-up latency and is a miss from ilb point of view.

- Get rid of duplication;

- Be not limited to pinned case. 

If you have any proposal, i would be appreciated if you could
share your specific view.

> Completely ignoring low utilization kernel threads would go a
> long way to getting rid of bouncing userspace (which tends to have a
> meaningful footprint), all over hell and creation.
> 
> You could also periodically send mobile kthreads down the slow path to
> try to keep them the hell away from partially busy CPUs, as well as
> anything else that hasn't run for a while, to keep background cruft
> from continually injecting itself into the middle of a cross core
> cyber-sex.

CPU is not considered idle (in terms of idle_cpu()) until it hits a
first rule checking a condition if current is idle thread or not. If we
hit last check when a claim wake-up is set it means that CPU switches
from idle state to non-idle one and it will happen quite soon depending
on wake-up latency.

Considering a core as not-idle when somebody tends to wake up a task on
it is a good point. If you have any specific example when it is bad,
please share it.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-24 10:26             ` Uladzislau Rezki
@ 2017-11-24 18:46               ` Mike Galbraith
  2017-11-26 20:58                 ` Mike Galbraith
  2017-11-28  9:34                 ` Uladzislau Rezki
  0 siblings, 2 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-24 18:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
	Brendan Jackman, Josef Bacik, Ingo Molnar

On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote:
> 
> I guess there is misunderstanding here. The main goal is not to cover
> pinned case, for sure. I was thinking more about below points:
> 
> - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
>   predictable (that is good for mobile workloads). Because as it is
>   right now it simply returns a first CPU in a "nohz" mask and if we
>   know that CPU has been claimed i think it is worth to go with another
>   ILB core, since waking up on a remote CPU + doing nohz_idle_balance
>   does not improve wake-up latency and is a miss from ilb point of view.

Using unsynchronized access of a flag set by chaotic events all over
the box?

> If you have any proposal, i would be appreciated if you could
> share your specific view.

My view is you're barking up the wrong tree: you're making the idle
data SIS is using more accurate, but I question the benefit.  That it
makes an imperfect placement decision occasionally due to raciness is
nearly meaningless compared to the cost of frequent bounce.  Better
(but still not perfect, that requires atomics) state information
doesn't meaningfully improve decision quality.

> Considering a core as not-idle when somebody tends to wake up a task on
> it is a good point. If you have any specific example when it is bad,
> please share it.

I'm not sure how that will work out.  Probably like most everything wrt
SIS, first comes "woohoo" then "aw crap", and off to the trash it goes.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-24 18:46               ` Mike Galbraith
@ 2017-11-26 20:58                 ` Mike Galbraith
  2017-11-28  9:34                 ` Uladzislau Rezki
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2017-11-26 20:58 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
	Brendan Jackman, Josef Bacik, Ingo Molnar

On Fri, 2017-11-24 at 19:46 +0100, Mike Galbraith wrote:
> 
> My view is you're barking up the wrong tree: you're making the idle
> data SIS is using more accurate, but I question the benefit.  That it
> makes an imperfect placement decision occasionally due to raciness is
> nearly meaningless compared to the cost of frequent bounce.

Playing with SIS (yet again), below is a hack that I think illustrates
why I think the occasional races are nearly meaningless.

Box = i4790 desktop.

                       master   masterx
TCP_STREAM-1      Avg:  70495     71295
TCP_STREAM-2      Avg:  54388     66202
TCP_STREAM-4      Avg:  19316     21413
TCP_STREAM-8      Avg:   9678      8894
TCP_STREAM-16     Avg:   4286      4360
TCP_MAERTS-1      Avg:  69238     71799
TCP_MAERTS-2      Avg:  50729     65612
TCP_MAERTS-4      Avg:  19095     21984
TCP_MAERTS-8      Avg:   9405      8888
TCP_MAERTS-16     Avg:   4891      4371
TCP_RR-1          Avg: 198617    203291
TCP_RR-2          Avg: 152862    191761
TCP_RR-4          Avg: 112241    117888
TCP_RR-8          Avg: 104453    113260
TCP_RR-16         Avg:  50897     55280
UDP_RR-1          Avg: 250738    264214
UDP_RR-2          Avg: 196250    253352
UDP_RR-4          Avg: 152862    158819
UDP_RR-8          Avg: 143781    154071
UDP_RR-16         Avg:  68605     76492


tbench     1       2       4       8      16
master   772    1207    1829    3516    3440
masterx  811    1466    1959    3737    3670

hackbench -l 10000
5.917   5.990   5.957  avg 5.954   NO_SIS_DEBOUNCE
5.886   5.808   5.826  avg 5.840   SIS_DEBOUNCE


echo 0 > tracing_on
echo 1 > events/sched/sched_migrate_task/enable
start endless tbench 2

for i in `seq 3`
do
	echo > trace
	echo 1 > tracing_on
	sleep 10
	echo 0 > tracing_on
	cat trace|grep tbench|wc -l
done

kde desktop idling

NO_SIS_DEBOUNCE    261     208     199
SIS_DEBOUNCE         8       6       0

add firefox playing youtube documentary

NO_SIS_DEBOUNCE  10906   10094   10774
SIS_DEBOUNCE        34      34      34

tbench 2 throughput as firefox runs

NO_SIS_DEBOUNCE  1129.63 MB/sec
SIS_DEBOUNCE     1462.53 MB/sec

Advisory: welding goggles.

---
 include/linux/sched.h   |    3 ++-
 kernel/sched/fair.c     |   41 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |    1 +
 3 files changed, 42 insertions(+), 3 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -541,7 +541,6 @@ struct task_struct {
 	unsigned int			ptrace;
 
 #ifdef CONFIG_SMP
-	struct llist_node		wake_entry;
 	int				on_cpu;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */
@@ -549,8 +548,10 @@ struct task_struct {
 #endif
 	unsigned int			wakee_flips;
 	unsigned long			wakee_flip_decay_ts;
+	unsigned long			wakee_placed;
 	struct task_struct		*last_wakee;
 
+	struct llist_node		wake_entry;
 	int				wake_cpu;
 #endif
 	int				on_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6174,6 +6174,9 @@ static int select_idle_sibling(struct ta
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (sched_feat(SIS_DEBOUNCE))
+		p->wakee_placed = jiffies;
+
 	return target;
 }
 
@@ -6258,6 +6261,22 @@ static int wake_cap(struct task_struct *
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
+static bool task_placed(struct task_struct *p)
+{
+	return p->wakee_placed == jiffies;
+}
+
+static bool task_llc_affine_and_cold(struct task_struct *p, int cpu, int prev)
+{
+	int cold = sysctl_sched_migration_cost;
+
+	if (!cpus_share_cache(cpu, prev))
+		return false;
+	if (cold > 0 && rq_clock_task(cpu_rq(prev)) - p->se.exec_start > cold)
+		return true;
+	return false;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6276,16 +6295,26 @@ select_task_rq_fair(struct task_struct *
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_debounce = 0;
 	int sync = wake_flags & WF_SYNC;
 
+	rcu_read_lock();
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+		want_debounce = sched_feat(SIS_DEBOUNCE);
+		if (task_placed(p))
+			goto out_unlock;
+		/* Balance cold tasks to reduce hot task bounce tendency. */
+		if (want_debounce && task_llc_affine_and_cold(p, cpu, prev_cpu)) {
+			sd_flag |= SD_SHARE_PKG_RESOURCES;
+			sd = highest_flag_domain(prev_cpu, SD_SHARE_PKG_RESOURCES);
+			p->wakee_placed = jiffies;
+			goto pick_cpu_cold;
+		}
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
 			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6315,6 +6344,7 @@ select_task_rq_fair(struct task_struct *
 			new_cpu = cpu;
 	}
 
+pick_cpu_cold:
 	if (sd && !(sd_flag & SD_BALANCE_FORK)) {
 		/*
 		 * We're going to need the task's util for capacity_spare_wake
@@ -6329,9 +6359,13 @@ select_task_rq_fair(struct task_struct *
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
+		if (want_debounce && new_cpu == prev_cpu)
+			p->wakee_placed = jiffies;
+
 	} else {
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	}
+out_unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
@@ -6952,6 +6986,9 @@ static int task_hot(struct task_struct *
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
+	if (task_placed(p))
+		return 1;
+
 	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_DEBOUNCE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-24 18:46               ` Mike Galbraith
  2017-11-26 20:58                 ` Mike Galbraith
@ 2017-11-28  9:34                 ` Uladzislau Rezki
  2017-11-28 10:49                   ` Mike Galbraith
  1 sibling, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-28  9:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote:
> > 
> > I guess there is misunderstanding here. The main goal is not to cover
> > pinned case, for sure. I was thinking more about below points:
> > 
> > - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more
> >   predictable (that is good for mobile workloads). Because as it is
> >   right now it simply returns a first CPU in a "nohz" mask and if we
> >   know that CPU has been claimed i think it is worth to go with another
> >   ILB core, since waking up on a remote CPU + doing nohz_idle_balance
> >   does not improve wake-up latency and is a miss from ilb point of view.
> 
> Using unsynchronized access of a flag set by chaotic events all over
> the box?
Well, entering NO_HZ mode and checking nohz.idle_cpus_mask
nohz_balancer_kick() -> find_new_ilb() are obviously asynchronous
events and like you say is chaotic except of accessing to the
cpumask variable (i mean atomically). So do not see any issues
with that.

> 
> > If you have any proposal, i would be appreciated if you could
> > share your specific view.
> 
> My view is you're barking up the wrong tree: you're making the idle
> data SIS is using more accurate, but I question the benefit.  That it
> makes an imperfect placement decision occasionally due to raciness is
> nearly meaningless compared to the cost of frequent bounce.
Before sitting down and start testing, i just illustrated how we can
apply claim_wake_up to ilb asking community a specific view on it:
drawbacks, pros/cons, proposals etc.

>
> Better (but still not perfect, that requires atomics) state information
> doesn't meaningfully improve decision quality.
> 
I agree, it should be atomic, something like below:

core.c
+/* cpus with claim wake-up set bit */
+cpumask_t cpu_claim_wake_up_map;
...
@@ -1559,6 +1562,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
                     !cpu_online(cpu)))
                cpu = select_fallback_rq(task_cpu(p), p);

+       cpumask_test_and_set_cpu(cpu, &cpu_claim_wake_up_map);
        return cpu;
}
...
+       if (cpumask_test_cpu(cpu, &cpu_claim_wake_up_map))
+               return 0;
+
        return 1;
}

fair.c:
static inline int find_new_ilb(void)
{
-       int ilb = cpumask_first(nohz.idle_cpus_mask);
+       cpumask_t cpumask;
+       int ilb;
+
+       cpumask_andnot(&cpumask, nohz.idle_cpus_mask,
+                       &cpu_claim_wake_up_map);
+
+       ilb = cpumask_first(&cpumask);

> > Considering a core as not-idle when somebody tends to wake up a task on
> > it is a good point. If you have any specific example when it is bad,
> > please share it.
> 
> I'm not sure how that will work out.  Probably like most everything wrt
> SIS, first comes "woohoo" then "aw crap", and off to the trash it goes.
> 

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-28  9:34                 ` Uladzislau Rezki
@ 2017-11-28 10:49                   ` Mike Galbraith
  2017-11-29 10:41                     ` Uladzislau Rezki
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-28 10:49 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
	Brendan Jackman, Josef Bacik, Ingo Molnar

On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> 
> > My view is you're barking up the wrong tree: you're making the idle
> > data SIS is using more accurate, but I question the benefit.  That it
> > makes an imperfect placement decision occasionally due to raciness is
> > nearly meaningless compared to the cost of frequent bounce.

> Before sitting down and start testing, i just illustrated how we can
> apply claim_wake_up to ilb asking community a specific view on it:
> drawbacks, pros/cons, proposals etc.

Even if you make the thing atomic, what is ILB supposed to do, look
over its shoulder every step of the way and sh*t it's pants if somebody
touches claim_wake_up as it's about to or just after it did something?

If you intend to make all of LB race free, good luck.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-28 10:49                   ` Mike Galbraith
@ 2017-11-29 10:41                     ` Uladzislau Rezki
  2017-11-29 18:15                       ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-29 10:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> > 
> > > My view is you're barking up the wrong tree: you're making the idle
> > > data SIS is using more accurate, but I question the benefit.  That it
> > > makes an imperfect placement decision occasionally due to raciness is
> > > nearly meaningless compared to the cost of frequent bounce.
> 
> > Before sitting down and start testing, i just illustrated how we can
> > apply claim_wake_up to ilb asking community a specific view on it:
> > drawbacks, pros/cons, proposals etc.
> 
> Even if you make the thing atomic, what is ILB supposed to do, look
> over its shoulder every step of the way and sh*t it's pants if somebody
> touches claim_wake_up as it's about to or just after it did something?
If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
it means that TIF_NEED_RESCHED is coming or is already in place. When a
CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
unless it runs into scheduler_ipi or so.

> 
> If you intend to make all of LB race free, good luck.
> 
> 	-Mike

Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-29 10:41                     ` Uladzislau Rezki
@ 2017-11-29 18:15                       ` Mike Galbraith
  2017-11-30 12:30                         ` Uladzislau Rezki
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2017-11-29 18:15 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML,
	Brendan Jackman, Josef Bacik, Ingo Molnar

On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote:
> On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> > > 
> > > > My view is you're barking up the wrong tree: you're making the idle
> > > > data SIS is using more accurate, but I question the benefit.  That it
> > > > makes an imperfect placement decision occasionally due to raciness is
> > > > nearly meaningless compared to the cost of frequent bounce.
> > 
> > > Before sitting down and start testing, i just illustrated how we can
> > > apply claim_wake_up to ilb asking community a specific view on it:
> > > drawbacks, pros/cons, proposals etc.
> > 
> > Even if you make the thing atomic, what is ILB supposed to do, look
> > over its shoulder every step of the way and sh*t it's pants if somebody
> > touches claim_wake_up as it's about to or just after it did something?
> If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
> it means that TIF_NEED_RESCHED is coming or is already in place. When a
> CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
> unless it runs into scheduler_ipi or so.

Which means nothing to an LB operation in progress.

But whatever, I'm not going to argue endlessly about something I think
should be blatantly obvious.  IMO, this is a couple points shy of
pointless.  That's my 'C' to this RFC in a nutshell.  I'm done.

	-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.
  2017-11-29 18:15                       ` Mike Galbraith
@ 2017-11-30 12:30                         ` Uladzislau Rezki
  0 siblings, 0 replies; 25+ messages in thread
From: Uladzislau Rezki @ 2017-11-30 12:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes,
	LKML, Brendan Jackman, Josef Bacik, Ingo Molnar

On Wed, Nov 29, 2017 at 07:15:21PM +0100, Mike Galbraith wrote:
> On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote:
> > On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote:
> > > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote:
> > > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote:
> > > > 
> > > > > My view is you're barking up the wrong tree: you're making the idle
> > > > > data SIS is using more accurate, but I question the benefit.  That it
> > > > > makes an imperfect placement decision occasionally due to raciness is
> > > > > nearly meaningless compared to the cost of frequent bounce.
> > > 
> > > > Before sitting down and start testing, i just illustrated how we can
> > > > apply claim_wake_up to ilb asking community a specific view on it:
> > > > drawbacks, pros/cons, proposals etc.
> > > 
> > > Even if you make the thing atomic, what is ILB supposed to do, look
> > > over its shoulder every step of the way and sh*t it's pants if somebody
> > > touches claim_wake_up as it's about to or just after it did something?
> > If nohz.idle_cpus_mask is set for particular CPU together with claim mask,
> > it means that TIF_NEED_RESCHED is coming or is already in place. When a
> > CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode
> > unless it runs into scheduler_ipi or so.

> 
> Which means nothing to an LB operation in progress.
> 
<snip>
static void nohz_idle_balance(...)
...
    /*
     * If this cpu gets work to do, stop the load balancing
     * work being done for other cpus. Next load
     * balancing owner will pick it up.
     */
    if (need_resched())
        break;
...
<snip>

>
> But whatever, I'm not going to argue endlessly about something I think
> should be blatantly obvious.  IMO, this is a couple points shy of
> pointless.  That's my 'C' to this RFC in a nutshell.  I'm done.
> 
Thank you.

Vlad

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-11-30 12:30 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31  5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra
2017-10-31  5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra
2017-10-31  8:20   ` Peter Zijlstra
2017-10-31  8:48     ` Mike Galbraith
2017-11-01  6:08       ` Atish Patra
2017-11-01  6:54         ` Mike Galbraith
2017-11-01  7:18           ` Mike Galbraith
2017-11-01 16:36             ` Atish Patra
2017-11-01 20:20               ` Mike Galbraith
2017-11-05  0:58     ` Joel Fernandes
2017-11-22  5:23       ` Atish Patra
2017-11-23 10:52         ` Uladzislau Rezki
2017-11-23 13:13           ` Mike Galbraith
2017-11-23 16:00             ` Josef Bacik
2017-11-23 17:40               ` Mike Galbraith
2017-11-23 21:11               ` Atish Patra
2017-11-24 10:26             ` Uladzislau Rezki
2017-11-24 18:46               ` Mike Galbraith
2017-11-26 20:58                 ` Mike Galbraith
2017-11-28  9:34                 ` Uladzislau Rezki
2017-11-28 10:49                   ` Mike Galbraith
2017-11-29 10:41                     ` Uladzislau Rezki
2017-11-29 18:15                       ` Mike Galbraith
2017-11-30 12:30                         ` Uladzislau Rezki
2017-10-31  5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.