* wake_wide mechanism clarification @ 2017-06-30 0:19 Joel Fernandes 2017-06-30 0:49 ` Josef Bacik 0 siblings, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-06-30 0:19 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Josef Bacik Dear Mike, I wanted your kind help to understand your patch "sched: beef up wake_wide()"[1] which is a modification to the original patch from Michael Wang [2]. In particular, I didn't following the following comment: " to shared cache, we look for a minimum 'flip' frequency of llc_size in one partner, and a factor of lls_size higher frequency in the other." Why are wanting the master's flip frequency to be higher than the slaves by the factor? The code here is written as: if (slave < factor || master < slave * factor) return 0; However I think we should just do (with my current and probably wrong understanding): if (slave < factor || master < factor) return 0; Basically, I didn't follow why we multiply the slave's flips with llc_size. That makes it sound like the master has to have way more flips than the slave to return 0 from wake_wide. Could you maybe give an example to clarify? Thanks a lot for your help, I am also CC'ing Peter and some ARM folks for the discussion (and also Jocef who was discuss it with Mike on the mailing list few years ago). Thanks, Joel [1] https://patchwork.kernel.org/patch/6787941/ [2] https://lkml.org/lkml/2013/7/4/20 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 0:19 wake_wide mechanism clarification Joel Fernandes @ 2017-06-30 0:49 ` Josef Bacik 2017-06-30 3:04 ` Joel Fernandes ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Josef Bacik @ 2017-06-30 0:49 UTC (permalink / raw) To: Joel Fernandes Cc: Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath On Thu, Jun 29, 2017 at 05:19:14PM -0700, Joel Fernandes wrote: > Dear Mike, > > I wanted your kind help to understand your patch "sched: beef up > wake_wide()"[1] which is a modification to the original patch from > Michael Wang [2]. > > In particular, I didn't following the following comment: > " to shared cache, we look for a minimum 'flip' frequency of llc_size > in one partner, and a factor of lls_size higher frequency in the > other." > > Why are wanting the master's flip frequency to be higher than the > slaves by the factor? (Responding from my personal email as my work email is outlook shit and impossible to use) Because we are trying to detect the case that the master is waking many different processes, and the 'slave' processes are only waking up the master/some other specific processes to determine if we don't care about cache locality. > > The code here is written as: > > if (slave < factor || master < slave * factor) > return 0; > > However I think we should just do (with my current and probably wrong > understanding): > > if (slave < factor || master < factor) > return 0; > Actually I think both are wrong, but I need Mike to weigh in. In my example above we'd return 0, because the 'producer' will definitely have a wakee_flip of ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1, just the master to tell it that it's done. I _suppose_ in practice you have a lock or something so the wakee_flip isn't going to be strictly 1, but some significantly lower value than master. I'm skeptical of the slave < factor test, I think it's too high of a bar in the case where cache locality doesn't really matter, but the master < slave * factor makes sense, as slave is going to be orders of magnitude lower than master. > Basically, I didn't follow why we multiply the slave's flips with > llc_size. That makes it sound like the master has to have way more > flips than the slave to return 0 from wake_wide. Could you maybe give > an example to clarify? Thanks a lot for your help, > It may be worth to try with schedbench and trace it to see how this turns out in practice, as that's the workload that generated all this discussion before. I imagine generally speaking this works out properly. The small regression I reported before was at low RPS, so we wouldn't be waking up as many tasks as often, so we would be returning 0 from wake_wide() and we'd get screwed. This is where I think possibly dropping the slave < factor part of the test would address that, but I'd have to trace it to say for sure. Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 0:49 ` Josef Bacik @ 2017-06-30 3:04 ` Joel Fernandes 2017-06-30 14:28 ` Josef Bacik 2017-06-30 3:11 ` Mike Galbraith 2017-06-30 13:11 ` Matt Fleming 2 siblings, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-06-30 3:04 UTC (permalink / raw) To: Josef Bacik Cc: Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath Hi Josef, Thanks a lot for your reply, :-) On Thu, Jun 29, 2017 at 5:49 PM, Josef Bacik <josef@toxicpanda.com> wrote: > Because we are trying to detect the case that the master is waking many > different processes, and the 'slave' processes are only waking up the > master/some other specific processes to determine if we don't care about cache > locality. > >> >> The code here is written as: >> >> if (slave < factor || master < slave * factor) >> return 0; >> >> However I think we should just do (with my current and probably wrong >> understanding): >> >> if (slave < factor || master < factor) >> return 0; >> > > Actually I think both are wrong, but I need Mike to weigh in. In my example > above we'd return 0, because the 'producer' will definitely have a wakee_flip of > ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1, > just the master to tell it that it's done. I _suppose_ in practice you have a > lock or something so the wakee_flip isn't going to be strictly 1, but some > significantly lower value than master. I'm skeptical of the slave < factor > test, I think it's too high of a bar in the case where cache locality doesn't > really matter, but the master < slave * factor makes sense, as slave is going to > be orders of magnitude lower than master. That makes sense that we multiply slave's flips by a factor because its low, but I still didn't get why the factor is chosen to be llc_size instead of something else for the multiplication with slave (slave * factor). I know slave's flips are probably low and need to be higher, but why its multiplied with llc_size than some other number (in other words - why we are tying the size of a NUMA node or a cluster size with the number of wake-ups that the slave made)? Is this because of an assumption that a master is expected to evenly distribute work amongs CPUs within its node or something like that? More over, this case is only for when slave wakeups are far lower than the master. But, what about the case where slave wakes up are greater than the factor but approximately equal or around the same as the masters'. Then, it sounds like (master < slave * factor) can return true. In that case wake_wide() will be = 0. That sounds like a bad thing to do I think - pull a busy slave onto a busy master. >> Basically, I didn't follow why we multiply the slave's flips with >> llc_size. That makes it sound like the master has to have way more >> flips than the slave to return 0 from wake_wide. Could you maybe give >> an example to clarify? Thanks a lot for your help, >> > > It may be worth to try with schedbench and trace it to see how this turns out in > practice, as that's the workload that generated all this discussion before. I > imagine generally speaking this works out properly. The small regression I I see. I will try to find some time to play with this tool. Thanks for the pointer. > reported before was at low RPS, so we wouldn't be waking up as many tasks as > often, so we would be returning 0 from wake_wide() and we'd get screwed. This > is where I think possibly dropping the slave < factor part of the test would > address that, but I'd have to trace it to say for sure. Ah, I see your problem. I guess its a conflicting case/requirement? I guess somehow to handle your case, it has to be embedded into this condition that cache locality doesn't matter as that seems to be the assumption of slave < factor as you pointed. Thanks, Joel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 3:04 ` Joel Fernandes @ 2017-06-30 14:28 ` Josef Bacik 2017-06-30 17:02 ` Mike Galbraith 0 siblings, 1 reply; 29+ messages in thread From: Josef Bacik @ 2017-06-30 14:28 UTC (permalink / raw) To: Joel Fernandes Cc: Josef Bacik, Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > Hi Josef, > > Thanks a lot for your reply, :-) > > On Thu, Jun 29, 2017 at 5:49 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > Because we are trying to detect the case that the master is waking many > > different processes, and the 'slave' processes are only waking up the > > master/some other specific processes to determine if we don't care about cache > > locality. > > > >> > >> The code here is written as: > >> > >> if (slave < factor || master < slave * factor) > >> return 0; > >> > >> However I think we should just do (with my current and probably wrong > >> understanding): > >> > >> if (slave < factor || master < factor) > >> return 0; > >> > > > > Actually I think both are wrong, but I need Mike to weigh in. In my example > > above we'd return 0, because the 'producer' will definitely have a wakee_flip of > > ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1, > > just the master to tell it that it's done. I _suppose_ in practice you have a > > lock or something so the wakee_flip isn't going to be strictly 1, but some > > significantly lower value than master. I'm skeptical of the slave < factor > > test, I think it's too high of a bar in the case where cache locality doesn't > > really matter, but the master < slave * factor makes sense, as slave is going to > > be orders of magnitude lower than master. > > That makes sense that we multiply slave's flips by a factor because > its low, but I still didn't get why the factor is chosen to be > llc_size instead of something else for the multiplication with slave > (slave * factor). I know slave's flips are probably low and need to be > higher, but why its multiplied with llc_size than some other number > (in other words - why we are tying the size of a NUMA node or a > cluster size with the number of wake-ups that the slave made)? Is this > because of an assumption that a master is expected to evenly > distribute work amongs CPUs within its node or something like that? > Yeah I don't know why llc_size was chosen, but I've been thinking about this on and off since last night and I don't really know what the right thing to use would be. We need some arbitrary number, and any arbitrary number is going to be wrong for one workload when its fine for another workload. What we really want is to know how many different tasks we could potentially wake up to compare against, and that's kind of impossible. If we did it based soley on actual wakee_flips values we could end up with the case that 1 tasks that has 2 worker tasks always being wake_wide, and that may not be as helpful. With processes that are threaded then sure we have a readily available number to use, but for things that are discrete processes that talk over say a pipe or shared memory that's going to be unpossible to tease out. I haven't had my Mt. Dew this morning so if you have a better idea for a factor I'm all ears. > More over, this case is only for when slave wakeups are far lower than > the master. But, what about the case where slave wakes up are greater > than the factor but approximately equal or around the same as the > masters'. Then, it sounds like (master < slave * factor) can return > true. In that case wake_wide() will be = 0. That sounds like a bad > thing to do I think - pull a busy slave onto a busy master. > This I think is a flaw in our load balancing assumptions. I've been drowning in this code recently because of a cgroups imbalance problem. We don't really propagate load well for processes that are on the rq but haven't run yet, so I feel this plays into this problem where we think the cpu we're pulling to has plenty of capacity when in reality it doesnt. I'm going to tool around with this logic some this morning and see if I can make it a little bit smarter, I'll let you know how it goes. Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 14:28 ` Josef Bacik @ 2017-06-30 17:02 ` Mike Galbraith 2017-06-30 17:55 ` Josef Bacik 2017-07-29 8:01 ` Joel Fernandes 0 siblings, 2 replies; 29+ messages in thread From: Mike Galbraith @ 2017-06-30 17:02 UTC (permalink / raw) To: Josef Bacik, Joel Fernandes Cc: Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > > > That makes sense that we multiply slave's flips by a factor because > > its low, but I still didn't get why the factor is chosen to be > > llc_size instead of something else for the multiplication with slave > > (slave * factor). > Yeah I don't know why llc_size was chosen... static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds = NULL; struct sched_domain *sd; int id = cpu; int size = 1; sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); if (sd) { id = cpumask_first(sched_domain_span(sd)); size = cpumask_weight(sched_domain_span(sd)); sds = sd->shared; } rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); per_cpu(sd_llc_size, cpu) = size; The goal of wake wide was to approximate when pulling would be a futile consolidation effort and counterproductive to scaling. 'course with ever increasing socket size, any 1:N waker is ever more likely to run out of CPU for its one and only self (slamming into scaling wall) before it needing to turn its minions loose to conquer the world. Something else to consider: network interrupt waking multiple workers at high frequency. If the waking CPU is idle, do you really want to place a worker directly in front of a tattoo artist, or is it better off nearly anywhere but there? If the box is virtual, with no topology exposed (or real but ancient) to let select_idle_sibling() come to the rescue, two workers can even get tattooed simultaneously (see sync wakeup). -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 17:02 ` Mike Galbraith @ 2017-06-30 17:55 ` Josef Bacik 2017-08-03 10:53 ` Brendan Jackman 2017-07-29 8:01 ` Joel Fernandes 1 sibling, 1 reply; 29+ messages in thread From: Josef Bacik @ 2017-06-30 17:55 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > > > > > That makes sense that we multiply slave's flips by a factor because > > > its low, but I still didn't get why the factor is chosen to be > > > llc_size instead of something else for the multiplication with slave > > > (slave * factor). > > > Yeah I don't know why llc_size was chosen... > > static void update_top_cache_domain(int cpu) > { > struct sched_domain_shared *sds = NULL; > struct sched_domain *sd; > int id = cpu; > int size = 1; > > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > if (sd) { > id = cpumask_first(sched_domain_span(sd)); > size = cpumask_weight(sched_domain_span(sd)); > sds = sd->shared; > } > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > per_cpu(sd_llc_size, cpu) = size; > > The goal of wake wide was to approximate when pulling would be a futile > consolidation effort and counterproductive to scaling. 'course with > ever increasing socket size, any 1:N waker is ever more likely to run > out of CPU for its one and only self (slamming into scaling wall) > before it needing to turn its minions loose to conquer the world. > > Something else to consider: network interrupt waking multiple workers > at high frequency. If the waking CPU is idle, do you really want to > place a worker directly in front of a tattoo artist, or is it better > off nearly anywhere but there? > > If the box is virtual, with no topology exposed (or real but ancient) > to let select_idle_sibling() come to the rescue, two workers can even > get tattooed simultaneously (see sync wakeup). > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too big of a hammer, we probably need a middle ground. I'm messing with it right now so it's too early to say for sure, but i _suspect_ the bigger latencies we see are not because we overload the cpu we're trying to pull to, but because when we fail to do the wake_affine() we only look at siblings of the affine_sd instead of doing the full "find the idlest cpu in the land!" thing. I _think_ the answer is to make select_idle_sibling() try less hard to find something workable and only use obviously idle cpu's in the affine sd, and fall back to the full load balance esque search. This would make affine misses really expensive, but we can probably negate this by tracking per task how often it misses the target, and use that to adjust when we do wake_affine in the future for that task. Still experimenting some, I just found out a few hours ago I need to rework some of this to fix my cpu imbalance problem with cgroups, so once I get something working I'll throw it your way to take a look. Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 17:55 ` Josef Bacik @ 2017-08-03 10:53 ` Brendan Jackman 2017-08-03 13:15 ` Josef Bacik 0 siblings, 1 reply; 29+ messages in thread From: Brendan Jackman @ 2017-08-03 10:53 UTC (permalink / raw) To: Josef Bacik Cc: Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath Hi, On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >> > >> > > That makes sense that we multiply slave's flips by a factor because >> > > its low, but I still didn't get why the factor is chosen to be >> > > llc_size instead of something else for the multiplication with slave >> > > (slave * factor). >> >> > Yeah I don't know why llc_size was chosen... >> >> static void update_top_cache_domain(int cpu) >> { >> struct sched_domain_shared *sds = NULL; >> struct sched_domain *sd; >> int id = cpu; >> int size = 1; >> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >> if (sd) { >> id = cpumask_first(sched_domain_span(sd)); >> size = cpumask_weight(sched_domain_span(sd)); >> sds = sd->shared; >> } >> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >> per_cpu(sd_llc_size, cpu) = size; >> >> The goal of wake wide was to approximate when pulling would be a futile >> consolidation effort and counterproductive to scaling. 'course with >> ever increasing socket size, any 1:N waker is ever more likely to run >> out of CPU for its one and only self (slamming into scaling wall) >> before it needing to turn its minions loose to conquer the world. >> >> Something else to consider: network interrupt waking multiple workers >> at high frequency. If the waking CPU is idle, do you really want to >> place a worker directly in front of a tattoo artist, or is it better >> off nearly anywhere but there? >> >> If the box is virtual, with no topology exposed (or real but ancient) >> to let select_idle_sibling() come to the rescue, two workers can even >> get tattooed simultaneously (see sync wakeup). >> > > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too > big of a hammer, we probably need a middle ground. I'm messing with it right > now so it's too early to say for sure, but i _suspect_ the bigger latencies we > see are not because we overload the cpu we're trying to pull to, but because > when we fail to do the wake_affine() we only look at siblings of the affine_sd > instead of doing the full "find the idlest cpu in the land!" thing. This is the problem I've been hitting lately. My use case is 1 task per CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 task per CPU, they all do X amount of work then pthread_barrier_wait (i.e. sleep until the last task finishes its X and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU finish faster, and then those CPUs pull over the tasks that are still running: v CPU v ->time-> ------------- 0 (big) 11111 /333 ------------- 1 (big) 22222 /444| ------------- 2 (LITTLE) 333333/ ------------- 3 (LITTLE) 444444/ ------------- Now when task 4 hits the barrier (at |) and wakes the others up, there are 4 tasks with prev_cpu=<big> and 0 tasks with prev_cpu=<little>. Assuming that those wakeups happen on CPU4, regardless of wake_affine, want_affine means that we'll only look in sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the bigs until the next load balance, something like this: v CPU v ->time-> ------------------------ 0 (big) 11111 /333 31313\33333 ------------------------ 1 (big) 22222 /444|424\4444444 ------------------------ 2 (LITTLE) 333333/ \222222 ------------------------ 3 (LITTLE) 444444/ \1111 ------------------------ ^^^ underutilization > I _think_ > the answer is to make select_idle_sibling() try less hard to find something > workable and only use obviously idle cpu's in the affine sd, and fall back to > the full load balance esque search. So this idea of allowing select_idle_sibling to fail, and falling back to the slow path, would help me too, I think. This is also why I was playing with your don't-affine-recently-balanced-tasks patch[1], which also helps my case since it prevents want_affine for tasks 3 and 4 (which were recently moved by an active balance). [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 (also linked elsewhere in this thread) > This would make affine misses really expensive, but we can probably negate this > by tracking per task how often it misses the target, and use that to adjust when > we do wake_affine in the future for that task. Still experimenting some, I just > found out a few hours ago I need to rework some of this to fix my cpu imbalance > problem with cgroups, so once I get something working I'll throw it your way to > take a look. Thanks, Cheers, Brendan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-03 10:53 ` Brendan Jackman @ 2017-08-03 13:15 ` Josef Bacik 2017-08-03 15:05 ` Brendan Jackman 0 siblings, 1 reply; 29+ messages in thread From: Josef Bacik @ 2017-08-03 13:15 UTC (permalink / raw) To: Brendan Jackman Cc: Josef Bacik, Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote: > > Hi, > > On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: > > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: > >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > >> > > >> > > That makes sense that we multiply slave's flips by a factor because > >> > > its low, but I still didn't get why the factor is chosen to be > >> > > llc_size instead of something else for the multiplication with slave > >> > > (slave * factor). > >> > >> > Yeah I don't know why llc_size was chosen... > >> > >> static void update_top_cache_domain(int cpu) > >> { > >> struct sched_domain_shared *sds = NULL; > >> struct sched_domain *sd; > >> int id = cpu; > >> int size = 1; > >> > >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > >> if (sd) { > >> id = cpumask_first(sched_domain_span(sd)); > >> size = cpumask_weight(sched_domain_span(sd)); > >> sds = sd->shared; > >> } > >> > >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > >> per_cpu(sd_llc_size, cpu) = size; > >> > >> The goal of wake wide was to approximate when pulling would be a futile > >> consolidation effort and counterproductive to scaling. 'course with > >> ever increasing socket size, any 1:N waker is ever more likely to run > >> out of CPU for its one and only self (slamming into scaling wall) > >> before it needing to turn its minions loose to conquer the world. > >> > >> Something else to consider: network interrupt waking multiple workers > >> at high frequency. If the waking CPU is idle, do you really want to > >> place a worker directly in front of a tattoo artist, or is it better > >> off nearly anywhere but there? > >> > >> If the box is virtual, with no topology exposed (or real but ancient) > >> to let select_idle_sibling() come to the rescue, two workers can even > >> get tattooed simultaneously (see sync wakeup). > >> > > > > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too > > big of a hammer, we probably need a middle ground. I'm messing with it right > > now so it's too early to say for sure, but i _suspect_ the bigger latencies we > > see are not because we overload the cpu we're trying to pull to, but because > > when we fail to do the wake_affine() we only look at siblings of the affine_sd > > instead of doing the full "find the idlest cpu in the land!" thing. > > This is the problem I've been hitting lately. My use case is 1 task per > CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 > task per CPU, they all do X amount of work then pthread_barrier_wait > (i.e. sleep until the last task finishes its X and hits the barrier). On > big.LITTLE, the tasks which get a "big" CPU finish faster, and then > those CPUs pull over the tasks that are still running: > > v CPU v ->time-> > > ------------- > 0 (big) 11111 /333 > ------------- > 1 (big) 22222 /444| > ------------- > 2 (LITTLE) 333333/ > ------------- > 3 (LITTLE) 444444/ > ------------- > > Now when task 4 hits the barrier (at |) and wakes the others up, there > are 4 tasks with prev_cpu=<big> and 0 tasks with > prev_cpu=<little>. Assuming that those wakeups happen on CPU4, > regardless of wake_affine, want_affine means that we'll only look in > sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the > bigs until the next load balance, something like this: > > v CPU v ->time-> > > ------------------------ > 0 (big) 11111 /333 31313\33333 > ------------------------ > 1 (big) 22222 /444|424\4444444 > ------------------------ > 2 (LITTLE) 333333/ \222222 > ------------------------ > 3 (LITTLE) 444444/ \1111 > ------------------------ > ^^^ > underutilization > > > I _think_ > > the answer is to make select_idle_sibling() try less hard to find something > > workable and only use obviously idle cpu's in the affine sd, and fall back to > > the full load balance esque search. > > So this idea of allowing select_idle_sibling to fail, and falling back > to the slow path, would help me too, I think. Unfortunately this statement of mine was wrong, I had it in my head that we would fall back to a find the idlest cpu thing provided we failed to wake affine, but we just do select_idle_sibling() and expect the load balancer to move things around as needed. > > This is also why I was playing with your > don't-affine-recently-balanced-tasks patch[1], which also helps my case > since it prevents want_affine for tasks 3 and 4 (which were recently > moved by an active balance). > > [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 > (also linked elsewhere in this thread) > Would you try peter's sched/experimental branch and see how that affects your workload? I'm still messing with my patches and I may drop this one as it now appears to be too aggressive with the new set of patches. Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-03 13:15 ` Josef Bacik @ 2017-08-03 15:05 ` Brendan Jackman 2017-08-09 21:22 ` Atish Patra 0 siblings, 1 reply; 29+ messages in thread From: Brendan Jackman @ 2017-08-03 15:05 UTC (permalink / raw) To: Josef Bacik Cc: Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote: > On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote: >> >> Hi, >> >> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: >> > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: >> >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >> >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >> >> > >> >> > > That makes sense that we multiply slave's flips by a factor because >> >> > > its low, but I still didn't get why the factor is chosen to be >> >> > > llc_size instead of something else for the multiplication with slave >> >> > > (slave * factor). >> >> >> >> > Yeah I don't know why llc_size was chosen... >> >> >> >> static void update_top_cache_domain(int cpu) >> >> { >> >> struct sched_domain_shared *sds = NULL; >> >> struct sched_domain *sd; >> >> int id = cpu; >> >> int size = 1; >> >> >> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >> >> if (sd) { >> >> id = cpumask_first(sched_domain_span(sd)); >> >> size = cpumask_weight(sched_domain_span(sd)); >> >> sds = sd->shared; >> >> } >> >> >> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >> >> per_cpu(sd_llc_size, cpu) = size; >> >> >> >> The goal of wake wide was to approximate when pulling would be a futile >> >> consolidation effort and counterproductive to scaling. 'course with >> >> ever increasing socket size, any 1:N waker is ever more likely to run >> >> out of CPU for its one and only self (slamming into scaling wall) >> >> before it needing to turn its minions loose to conquer the world. >> >> >> >> Something else to consider: network interrupt waking multiple workers >> >> at high frequency. If the waking CPU is idle, do you really want to >> >> place a worker directly in front of a tattoo artist, or is it better >> >> off nearly anywhere but there? >> >> >> >> If the box is virtual, with no topology exposed (or real but ancient) >> >> to let select_idle_sibling() come to the rescue, two workers can even >> >> get tattooed simultaneously (see sync wakeup). >> >> >> > >> > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too >> > big of a hammer, we probably need a middle ground. I'm messing with it right >> > now so it's too early to say for sure, but i _suspect_ the bigger latencies we >> > see are not because we overload the cpu we're trying to pull to, but because >> > when we fail to do the wake_affine() we only look at siblings of the affine_sd >> > instead of doing the full "find the idlest cpu in the land!" thing. >> >> This is the problem I've been hitting lately. My use case is 1 task per >> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 >> task per CPU, they all do X amount of work then pthread_barrier_wait >> (i.e. sleep until the last task finishes its X and hits the barrier). On >> big.LITTLE, the tasks which get a "big" CPU finish faster, and then >> those CPUs pull over the tasks that are still running: >> >> v CPU v ->time-> >> >> ------------- >> 0 (big) 11111 /333 >> ------------- >> 1 (big) 22222 /444| >> ------------- >> 2 (LITTLE) 333333/ >> ------------- >> 3 (LITTLE) 444444/ >> ------------- >> >> Now when task 4 hits the barrier (at |) and wakes the others up, there >> are 4 tasks with prev_cpu=<big> and 0 tasks with >> prev_cpu=<little>. Assuming that those wakeups happen on CPU4, >> regardless of wake_affine, want_affine means that we'll only look in >> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the >> bigs until the next load balance, something like this: >> >> v CPU v ->time-> >> >> ------------------------ >> 0 (big) 11111 /333 31313\33333 >> ------------------------ >> 1 (big) 22222 /444|424\4444444 >> ------------------------ >> 2 (LITTLE) 333333/ \222222 >> ------------------------ >> 3 (LITTLE) 444444/ \1111 >> ------------------------ >> ^^^ >> underutilization >> >> > I _think_ >> > the answer is to make select_idle_sibling() try less hard to find something >> > workable and only use obviously idle cpu's in the affine sd, and fall back to >> > the full load balance esque search. >> >> So this idea of allowing select_idle_sibling to fail, and falling back >> to the slow path, would help me too, I think. > > Unfortunately this statement of mine was wrong, I had it in my head that we > would fall back to a find the idlest cpu thing provided we failed to wake > affine, but we just do select_idle_sibling() and expect the load balancer to > move things around as needed. Ah yes, when wake_affine() returns false, we still do select_idle_sibling (except in prev_cpu's sd_llc instead of smp_processor_id()'s), and that is the problem faced by my workload. I thought you were suggesting to change the flow so that select_idle_sibling can say "I didn't find any idle siblings - go to the find_idlest_group path". >> This is also why I was playing with your >> don't-affine-recently-balanced-tasks patch[1], which also helps my case >> since it prevents want_affine for tasks 3 and 4 (which were recently >> moved by an active balance). >> >> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 >> (also linked elsewhere in this thread) >> > > Would you try peter's sched/experimental branch and see how that affects your > workload? I'm still messing with my patches and I may drop this one as it now > appears to be too aggressive with the new set of patches. Thanks, Sure, I'll take a look at those, thanks. I guess the idea of caching values in LB and then using them in wakeup[2] is a lighter-handed way of achieving the same thing as last_balance_ts? It won't solve my problem directly since we'll still only look in sd_llc, but I think it could be a basis for a way to say "go find_idlest_group path on these tasks" at the beginning of select_task_rq_fair. [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef Thanks, Brendan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-03 15:05 ` Brendan Jackman @ 2017-08-09 21:22 ` Atish Patra 2017-08-10 9:48 ` Brendan Jackman 0 siblings, 1 reply; 29+ messages in thread From: Atish Patra @ 2017-08-09 21:22 UTC (permalink / raw) To: Brendan Jackman, Josef Bacik Cc: Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath On 08/03/2017 10:05 AM, Brendan Jackman wrote: > > On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote: >> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote: >>> >>> Hi, >>> >>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: >>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: >>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>>>> >>>>>>> That makes sense that we multiply slave's flips by a factor because >>>>>>> its low, but I still didn't get why the factor is chosen to be >>>>>>> llc_size instead of something else for the multiplication with slave >>>>>>> (slave * factor). >>>>> >>>>>> Yeah I don't know why llc_size was chosen... >>>>> >>>>> static void update_top_cache_domain(int cpu) >>>>> { >>>>> struct sched_domain_shared *sds = NULL; >>>>> struct sched_domain *sd; >>>>> int id = cpu; >>>>> int size = 1; >>>>> >>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>>>> if (sd) { >>>>> id = cpumask_first(sched_domain_span(sd)); >>>>> size = cpumask_weight(sched_domain_span(sd)); >>>>> sds = sd->shared; >>>>> } >>>>> >>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>>>> per_cpu(sd_llc_size, cpu) = size; >>>>> >>>>> The goal of wake wide was to approximate when pulling would be a futile >>>>> consolidation effort and counterproductive to scaling. 'course with >>>>> ever increasing socket size, any 1:N waker is ever more likely to run >>>>> out of CPU for its one and only self (slamming into scaling wall) >>>>> before it needing to turn its minions loose to conquer the world. >>>>> >>>>> Something else to consider: network interrupt waking multiple workers >>>>> at high frequency. If the waking CPU is idle, do you really want to >>>>> place a worker directly in front of a tattoo artist, or is it better >>>>> off nearly anywhere but there? >>>>> >>>>> If the box is virtual, with no topology exposed (or real but ancient) >>>>> to let select_idle_sibling() come to the rescue, two workers can even >>>>> get tattooed simultaneously (see sync wakeup). >>>>> >>>> >>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too >>>> big of a hammer, we probably need a middle ground. I'm messing with it right >>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we >>>> see are not because we overload the cpu we're trying to pull to, but because >>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd >>>> instead of doing the full "find the idlest cpu in the land!" thing. >>> >>> This is the problem I've been hitting lately. My use case is 1 task per >>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 >>> task per CPU, they all do X amount of work then pthread_barrier_wait >>> (i.e. sleep until the last task finishes its X and hits the barrier). On >>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then >>> those CPUs pull over the tasks that are still running: >>> >>> v CPU v ->time-> >>> >>> ------------- >>> 0 (big) 11111 /333 >>> ------------- >>> 1 (big) 22222 /444| >>> ------------- >>> 2 (LITTLE) 333333/ >>> ------------- >>> 3 (LITTLE) 444444/ >>> ------------- >>> >>> Now when task 4 hits the barrier (at |) and wakes the others up, there >>> are 4 tasks with prev_cpu=<big> and 0 tasks with >>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4, >>> regardless of wake_affine, want_affine means that we'll only look in >>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the >>> bigs until the next load balance, something like this: >>> >>> v CPU v ->time-> >>> >>> ------------------------ >>> 0 (big) 11111 /333 31313\33333 >>> ------------------------ >>> 1 (big) 22222 /444|424\4444444 >>> ------------------------ >>> 2 (LITTLE) 333333/ \222222 >>> ------------------------ >>> 3 (LITTLE) 444444/ \1111 >>> ------------------------ >>> ^^^ >>> underutilization >>> >>>> I _think_ >>>> the answer is to make select_idle_sibling() try less hard to find something >>>> workable and only use obviously idle cpu's in the affine sd, and fall back to >>>> the full load balance esque search. >>> >>> So this idea of allowing select_idle_sibling to fail, and falling back >>> to the slow path, would help me too, I think. >> >> Unfortunately this statement of mine was wrong, I had it in my head that we >> would fall back to a find the idlest cpu thing provided we failed to wake >> affine, but we just do select_idle_sibling() and expect the load balancer to >> move things around as needed. > > Ah yes, when wake_affine() returns false, we still do > select_idle_sibling (except in prev_cpu's sd_llc instead of > smp_processor_id()'s), and that is the problem faced by my workload. I > thought you were suggesting to change the flow so that > select_idle_sibling can say "I didn't find any idle siblings - go to the > find_idlest_group path". > >>> This is also why I was playing with your >>> don't-affine-recently-balanced-tasks patch[1], which also helps my case >>> since it prevents want_affine for tasks 3 and 4 (which were recently >>> moved by an active balance). >>> >>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 >>> (also linked elsewhere in this thread) >>> >> >> Would you try peter's sched/experimental branch and see how that affects your >> workload? I'm still messing with my patches and I may drop this one as it now >> appears to be too aggressive with the new set of patches. Thanks, > > Sure, I'll take a look at those, thanks. I guess the idea of caching > values in LB and then using them in wakeup[2] is a lighter-handed way of > achieving the same thing as last_balance_ts? It won't solve my problem > directly since we'll still only look in sd_llc, but I think it could be > a basis for a way to say "go find_idlest_group path on these tasks" at > the beginning of select_task_rq_fair. > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef > > Thanks, > Brendan > Would it be better if it searches for idle cpus in next higher domain (if that is not NUMA) instead of doing find_idlest_group ? The above approach [2] will still try to search a idle cpu in the llc domain of new_cpu. If it finds one great. Otherwise, let's search for an idle cpu in next higher domain(if that is not NUMA) excluding the current domain which we have already searched. I think it would help in cases where LLC < NUMA. Regards, Atish ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-09 21:22 ` Atish Patra @ 2017-08-10 9:48 ` Brendan Jackman 2017-08-10 17:41 ` Atish Patra 0 siblings, 1 reply; 29+ messages in thread From: Brendan Jackman @ 2017-08-10 9:48 UTC (permalink / raw) To: atish.patra Cc: Josef Bacik, Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath On Wed, Aug 09 2017 at 21:22, Atish Patra wrote: > On 08/03/2017 10:05 AM, Brendan Jackman wrote: >> >> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote: >>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote: >>>> >>>> Hi, >>>> >>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: >>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: >>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>>>>> >>>>>>>> That makes sense that we multiply slave's flips by a factor because >>>>>>>> its low, but I still didn't get why the factor is chosen to be >>>>>>>> llc_size instead of something else for the multiplication with slave >>>>>>>> (slave * factor). >>>>>> >>>>>>> Yeah I don't know why llc_size was chosen... >>>>>> >>>>>> static void update_top_cache_domain(int cpu) >>>>>> { >>>>>> struct sched_domain_shared *sds = NULL; >>>>>> struct sched_domain *sd; >>>>>> int id = cpu; >>>>>> int size = 1; >>>>>> >>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>>>>> if (sd) { >>>>>> id = cpumask_first(sched_domain_span(sd)); >>>>>> size = cpumask_weight(sched_domain_span(sd)); >>>>>> sds = sd->shared; >>>>>> } >>>>>> >>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>>>>> per_cpu(sd_llc_size, cpu) = size; >>>>>> >>>>>> The goal of wake wide was to approximate when pulling would be a futile >>>>>> consolidation effort and counterproductive to scaling. 'course with >>>>>> ever increasing socket size, any 1:N waker is ever more likely to run >>>>>> out of CPU for its one and only self (slamming into scaling wall) >>>>>> before it needing to turn its minions loose to conquer the world. >>>>>> >>>>>> Something else to consider: network interrupt waking multiple workers >>>>>> at high frequency. If the waking CPU is idle, do you really want to >>>>>> place a worker directly in front of a tattoo artist, or is it better >>>>>> off nearly anywhere but there? >>>>>> >>>>>> If the box is virtual, with no topology exposed (or real but ancient) >>>>>> to let select_idle_sibling() come to the rescue, two workers can even >>>>>> get tattooed simultaneously (see sync wakeup). >>>>>> >>>>> >>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too >>>>> big of a hammer, we probably need a middle ground. I'm messing with it right >>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we >>>>> see are not because we overload the cpu we're trying to pull to, but because >>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd >>>>> instead of doing the full "find the idlest cpu in the land!" thing. >>>> >>>> This is the problem I've been hitting lately. My use case is 1 task per >>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 >>>> task per CPU, they all do X amount of work then pthread_barrier_wait >>>> (i.e. sleep until the last task finishes its X and hits the barrier). On >>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then >>>> those CPUs pull over the tasks that are still running: >>>> >>>> v CPU v ->time-> >>>> >>>> ------------- >>>> 0 (big) 11111 /333 >>>> ------------- >>>> 1 (big) 22222 /444| >>>> ------------- >>>> 2 (LITTLE) 333333/ >>>> ------------- >>>> 3 (LITTLE) 444444/ >>>> ------------- >>>> >>>> Now when task 4 hits the barrier (at |) and wakes the others up, there >>>> are 4 tasks with prev_cpu=<big> and 0 tasks with >>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4, >>>> regardless of wake_affine, want_affine means that we'll only look in >>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the >>>> bigs until the next load balance, something like this: >>>> >>>> v CPU v ->time-> >>>> >>>> ------------------------ >>>> 0 (big) 11111 /333 31313\33333 >>>> ------------------------ >>>> 1 (big) 22222 /444|424\4444444 >>>> ------------------------ >>>> 2 (LITTLE) 333333/ \222222 >>>> ------------------------ >>>> 3 (LITTLE) 444444/ \1111 >>>> ------------------------ >>>> ^^^ >>>> underutilization >>>> >>>>> I _think_ >>>>> the answer is to make select_idle_sibling() try less hard to find something >>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to >>>>> the full load balance esque search. >>>> >>>> So this idea of allowing select_idle_sibling to fail, and falling back >>>> to the slow path, would help me too, I think. >>> >>> Unfortunately this statement of mine was wrong, I had it in my head that we >>> would fall back to a find the idlest cpu thing provided we failed to wake >>> affine, but we just do select_idle_sibling() and expect the load balancer to >>> move things around as needed. >> >> Ah yes, when wake_affine() returns false, we still do >> select_idle_sibling (except in prev_cpu's sd_llc instead of >> smp_processor_id()'s), and that is the problem faced by my workload. I >> thought you were suggesting to change the flow so that >> select_idle_sibling can say "I didn't find any idle siblings - go to the >> find_idlest_group path". >> >>>> This is also why I was playing with your >>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case >>>> since it prevents want_affine for tasks 3 and 4 (which were recently >>>> moved by an active balance). >>>> >>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 >>>> (also linked elsewhere in this thread) >>>> >>> >>> Would you try peter's sched/experimental branch and see how that affects your >>> workload? I'm still messing with my patches and I may drop this one as it now >>> appears to be too aggressive with the new set of patches. Thanks, >> >> Sure, I'll take a look at those, thanks. I guess the idea of caching >> values in LB and then using them in wakeup[2] is a lighter-handed way of >> achieving the same thing as last_balance_ts? It won't solve my problem >> directly since we'll still only look in sd_llc, but I think it could be >> a basis for a way to say "go find_idlest_group path on these tasks" at >> the beginning of select_task_rq_fair. >> >> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef >> >> Thanks, >> Brendan >> > > Would it be better if it searches for idle cpus in next higher domain > (if that is not NUMA) instead of doing find_idlest_group ? > > The above approach [2] will still try to search a idle cpu in the llc > domain of new_cpu. If it finds one great. Otherwise, let's search for an > idle cpu in next higher domain(if that is not NUMA) excluding the > current domain which we have already searched. I think it would help in > cases where LLC < NUMA. Well for my use case, "search for idle cpus in the next higher domain" and "go to find_busiest_group & co" mean the same thing. But for NUMA boxes (where I'm guessing there are >=3 sched_domains?) then yeah that sounds like a good idea at least to my inexpert ears. Currently it's either "only look at siblings of either prev_cpu or smp_processor_id()" or "look through the whole system (or at least as far as sd_flag takes us)", so maybe you want middle ground? On the other hand, I'm guessing remote wakeups across NUMA nodes are expensive, so that's why we only look at siblings? There's already a choice between prev_cpu's and smp_processor_id()'s LLC siblings. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-10 9:48 ` Brendan Jackman @ 2017-08-10 17:41 ` Atish Patra 0 siblings, 0 replies; 29+ messages in thread From: Atish Patra @ 2017-08-10 17:41 UTC (permalink / raw) To: Brendan Jackman Cc: Josef Bacik, Mike Galbraith, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Chris Redpath On 08/10/2017 04:48 AM, Brendan Jackman wrote: > > On Wed, Aug 09 2017 at 21:22, Atish Patra wrote: >> On 08/03/2017 10:05 AM, Brendan Jackman wrote: >>> >>> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote: >>>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote: >>>>> >>>>> Hi, >>>>> >>>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote: >>>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: >>>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>>>>>> >>>>>>>>> That makes sense that we multiply slave's flips by a factor because >>>>>>>>> its low, but I still didn't get why the factor is chosen to be >>>>>>>>> llc_size instead of something else for the multiplication with slave >>>>>>>>> (slave * factor). >>>>>>> >>>>>>>> Yeah I don't know why llc_size was chosen... >>>>>>> >>>>>>> static void update_top_cache_domain(int cpu) >>>>>>> { >>>>>>> struct sched_domain_shared *sds = NULL; >>>>>>> struct sched_domain *sd; >>>>>>> int id = cpu; >>>>>>> int size = 1; >>>>>>> >>>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>>>>>> if (sd) { >>>>>>> id = cpumask_first(sched_domain_span(sd)); >>>>>>> size = cpumask_weight(sched_domain_span(sd)); >>>>>>> sds = sd->shared; >>>>>>> } >>>>>>> >>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>>>>>> per_cpu(sd_llc_size, cpu) = size; >>>>>>> >>>>>>> The goal of wake wide was to approximate when pulling would be a futile >>>>>>> consolidation effort and counterproductive to scaling. 'course with >>>>>>> ever increasing socket size, any 1:N waker is ever more likely to run >>>>>>> out of CPU for its one and only self (slamming into scaling wall) >>>>>>> before it needing to turn its minions loose to conquer the world. >>>>>>> >>>>>>> Something else to consider: network interrupt waking multiple workers >>>>>>> at high frequency. If the waking CPU is idle, do you really want to >>>>>>> place a worker directly in front of a tattoo artist, or is it better >>>>>>> off nearly anywhere but there? >>>>>>> >>>>>>> If the box is virtual, with no topology exposed (or real but ancient) >>>>>>> to let select_idle_sibling() come to the rescue, two workers can even >>>>>>> get tattooed simultaneously (see sync wakeup). >>>>>>> >>>>>> >>>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too >>>>>> big of a hammer, we probably need a middle ground. I'm messing with it right >>>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we >>>>>> see are not because we overload the cpu we're trying to pull to, but because >>>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd >>>>>> instead of doing the full "find the idlest cpu in the land!" thing. >>>>> >>>>> This is the problem I've been hitting lately. My use case is 1 task per >>>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1 >>>>> task per CPU, they all do X amount of work then pthread_barrier_wait >>>>> (i.e. sleep until the last task finishes its X and hits the barrier). On >>>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then >>>>> those CPUs pull over the tasks that are still running: >>>>> >>>>> v CPU v ->time-> >>>>> >>>>> ------------- >>>>> 0 (big) 11111 /333 >>>>> ------------- >>>>> 1 (big) 22222 /444| >>>>> ------------- >>>>> 2 (LITTLE) 333333/ >>>>> ------------- >>>>> 3 (LITTLE) 444444/ >>>>> ------------- >>>>> >>>>> Now when task 4 hits the barrier (at |) and wakes the others up, there >>>>> are 4 tasks with prev_cpu=<big> and 0 tasks with >>>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4, >>>>> regardless of wake_affine, want_affine means that we'll only look in >>>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the >>>>> bigs until the next load balance, something like this: >>>>> >>>>> v CPU v ->time-> >>>>> >>>>> ------------------------ >>>>> 0 (big) 11111 /333 31313\33333 >>>>> ------------------------ >>>>> 1 (big) 22222 /444|424\4444444 >>>>> ------------------------ >>>>> 2 (LITTLE) 333333/ \222222 >>>>> ------------------------ >>>>> 3 (LITTLE) 444444/ \1111 >>>>> ------------------------ >>>>> ^^^ >>>>> underutilization >>>>> >>>>>> I _think_ >>>>>> the answer is to make select_idle_sibling() try less hard to find something >>>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to >>>>>> the full load balance esque search. >>>>> >>>>> So this idea of allowing select_idle_sibling to fail, and falling back >>>>> to the slow path, would help me too, I think. >>>> >>>> Unfortunately this statement of mine was wrong, I had it in my head that we >>>> would fall back to a find the idlest cpu thing provided we failed to wake >>>> affine, but we just do select_idle_sibling() and expect the load balancer to >>>> move things around as needed. >>> >>> Ah yes, when wake_affine() returns false, we still do >>> select_idle_sibling (except in prev_cpu's sd_llc instead of >>> smp_processor_id()'s), and that is the problem faced by my workload. I >>> thought you were suggesting to change the flow so that >>> select_idle_sibling can say "I didn't find any idle siblings - go to the >>> find_idlest_group path". >>> >>>>> This is also why I was playing with your >>>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case >>>>> since it prevents want_affine for tasks 3 and 4 (which were recently >>>>> moved by an active balance). >>>>> >>>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2 >>>>> (also linked elsewhere in this thread) >>>>> >>>> >>>> Would you try peter's sched/experimental branch and see how that affects your >>>> workload? I'm still messing with my patches and I may drop this one as it now >>>> appears to be too aggressive with the new set of patches. Thanks, >>> >>> Sure, I'll take a look at those, thanks. I guess the idea of caching >>> values in LB and then using them in wakeup[2] is a lighter-handed way of >>> achieving the same thing as last_balance_ts? It won't solve my problem >>> directly since we'll still only look in sd_llc, but I think it could be >>> a basis for a way to say "go find_idlest_group path on these tasks" at >>> the beginning of select_task_rq_fair. >>> >>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef >>> >>> Thanks, >>> Brendan >>> >> >> Would it be better if it searches for idle cpus in next higher domain >> (if that is not NUMA) instead of doing find_idlest_group ? >> >> The above approach [2] will still try to search a idle cpu in the llc >> domain of new_cpu. If it finds one great. Otherwise, let's search for an >> idle cpu in next higher domain(if that is not NUMA) excluding the >> current domain which we have already searched. I think it would help in >> cases where LLC < NUMA. > > Well for my use case, "search for idle cpus in the next higher domain" > and "go to find_busiest_group & co" mean the same thing. Yeah. It just may be tad cheaper to do idle cpu search than to find_idlest_group and find_idlest_cpu. But for NUMA > boxes (where I'm guessing there are >=3 sched_domains?) Yeah. then yeah that > sounds like a good idea at least to my inexpert ears. Currently it's > either "only look at siblings of either prev_cpu or smp_processor_id()" > or "look through the whole system (or at least as far as sd_flag takes > us)", so maybe you want middle ground? > > On the other hand, I'm guessing remote wakeups across NUMA nodes are > expensive, so that's why we only look at siblings? Correct. But I am suggesting the above modification impacting only for archs where LLC < NUMA i.e. you have SMT, LLC, DIE/CPU, NUMA.. domains. For your case, I guess the domains would stop at CPU. Any architecture with above domain hierarchy can search wide to find out an idle cpu. Other architectures where NUMA is the next domain to LLC, it won't do anything. There's already a > choice between prev_cpu's and smp_processor_id()'s LLC siblings. > Yes. So we may benefit by searching idle cpus other LLC groups as well. Initially, I was worried about it may be too expensive to search wide but Peter's recent patch (1ad3aaf3fcd2:Implement new approach to scale select_idle_cpu) should abandon the search if it becomes too costly. Did I miss any case it may affect performance negatively ? Regards, Atish ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 17:02 ` Mike Galbraith 2017-06-30 17:55 ` Josef Bacik @ 2017-07-29 8:01 ` Joel Fernandes 2017-07-29 8:13 ` Joel Fernandes 2017-07-29 15:07 ` Mike Galbraith 1 sibling, 2 replies; 29+ messages in thread From: Joel Fernandes @ 2017-07-29 8:01 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming Hi Mike, I have take spent some time understanding the email thread and previous discussions. Unfortunately the second condition we are checking for in the wake_wide still didn't make sense to me (mentioned below) :-( On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >> >> > That makes sense that we multiply slave's flips by a factor because >> > its low, but I still didn't get why the factor is chosen to be >> > llc_size instead of something else for the multiplication with slave >> > (slave * factor). > >> Yeah I don't know why llc_size was chosen... > > static void update_top_cache_domain(int cpu) > { > struct sched_domain_shared *sds = NULL; > struct sched_domain *sd; > int id = cpu; > int size = 1; > > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > if (sd) { > id = cpumask_first(sched_domain_span(sd)); > size = cpumask_weight(sched_domain_span(sd)); > sds = sd->shared; > } > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > per_cpu(sd_llc_size, cpu) = size; > > The goal of wake wide was to approximate when pulling would be a futile > consolidation effort and counterproductive to scaling. 'course with > ever increasing socket size, any 1:N waker is ever more likely to run > out of CPU for its one and only self (slamming into scaling wall) > before it needing to turn its minions loose to conquer the world. Actually the original question was why do we have the second condition as "master < slave * factor", instead of "master < factor". that's what didn't make sense to me. Why don't we return 0 from wake_wide if master < factor ? Infact, as the factor is set to the llc_size, I think the condition that makes sense to me is: if ((master + slave) < llc_size) return 0; In other words, if the master flips and the slave flips are totally higher than the llc_size, then we are most likely waking up too many tasks as affine and should then switch to wide to prevent overloading. Digging further into the original patch from Michael Wang (I also CC'd him), this was the code (before you had changed it to master/slave): wakee->nr_wakee_switch > factor && waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) To explain the second condition above, Michael Wang said the following in [1] "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple tasks rely on it, then waker's higher latency will damage all of them, pull wakee seems to be a bad deal." Again I didn't follow why the second condition couldn't just be: waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + wakee->nr_wakee_switch) > factor, based on the above explanation from Micheal Wang that I quoted. and why he's instead doing the whole multiplication thing there that I was talking about earlier: "factor * wakee->nr_wakee_switch". Rephrasing my question in another way, why are we talking the ratio of master/slave instead of the sum when comparing if its > factor? I am surely missing something here. Just taking an example: Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common between all 3 masters. Also to make it a bit more interesting, let s8 wake up some random task T0. A diagram to show the master/slave relation ships could look like: +-----+ | | +------------------------+ +------+ M2 | | | | | | | M1 | | +--+--+---- | | | | | | | | | | | | | s15 +--+--+--+--+--+--+---+--+---+ v v v | | | | | | | | s9 s10 s11 v v v v v v v v s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 ^ | +-+---+ | | | M3 | | | +--+--+----- | | | | v v v v s12 s13 s14 s16 Lets consider the case of M1 waking up s8. As per the above diagram, M1 has 8 flips and s8 has 4 flips. With llc_size = 3, the condition (slave < factor) would return FALSE, so then we would turn to the (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), so wake_wide would return 0 and would cause s8 to be woken up as affine with relation to M1's core. So basically, it seems the heuristic is saying (with help of the second condition - master < slave * factor). that Its a good idea for s8 to be affine-woken-up with respect to M1's core. Why is it a good idea to do that? It seems to me M1 has already several tasks its waking as affine so causing s8 to be woken up affine could be harmful and it may be a better choice to wake it up elsewhere. Thanks for your help! -Joel [1] https://lkml.org/lkml/2013/7/4/20 > > Something else to consider: network interrupt waking multiple workers > at high frequency. If the waking CPU is idle, do you really want to > place a worker directly in front of a tattoo artist, or is it better > off nearly anywhere but there? > > If the box is virtual, with no topology exposed (or real but ancient) > to let select_idle_sibling() come to the rescue, two workers can even > get tattooed simultaneously (see sync wakeup). > > -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 8:01 ` Joel Fernandes @ 2017-07-29 8:13 ` Joel Fernandes 2017-08-02 8:26 ` Michael Wang 2017-07-29 15:07 ` Mike Galbraith 1 sibling, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-07-29 8:13 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Matt Fleming, Michael Wang +Michael Wang on his current email address (old one bounced). (my reply was to Mike Galbraith but I also meant to CC Michael Wang for the discussion). Thanks On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@google.com> wrote: > Hi Mike, > > I have take spent some time understanding the email thread and > previous discussions. Unfortunately the second condition we are > checking for in the wake_wide still didn't make sense to me (mentioned > below) :-( > > On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith > <umgwanakikbuti@gmail.com> wrote: >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>> >>> > That makes sense that we multiply slave's flips by a factor because >>> > its low, but I still didn't get why the factor is chosen to be >>> > llc_size instead of something else for the multiplication with slave >>> > (slave * factor). >> >>> Yeah I don't know why llc_size was chosen... >> >> static void update_top_cache_domain(int cpu) >> { >> struct sched_domain_shared *sds = NULL; >> struct sched_domain *sd; >> int id = cpu; >> int size = 1; >> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >> if (sd) { >> id = cpumask_first(sched_domain_span(sd)); >> size = cpumask_weight(sched_domain_span(sd)); >> sds = sd->shared; >> } >> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >> per_cpu(sd_llc_size, cpu) = size; >> >> The goal of wake wide was to approximate when pulling would be a futile >> consolidation effort and counterproductive to scaling. 'course with >> ever increasing socket size, any 1:N waker is ever more likely to run >> out of CPU for its one and only self (slamming into scaling wall) >> before it needing to turn its minions loose to conquer the world. > > Actually the original question was why do we have the second condition > as "master < slave * factor", instead of "master < factor". that's > what didn't make sense to me. Why don't we return 0 from wake_wide if > master < factor ? > > Infact, as the factor is set to the llc_size, I think the condition > that makes sense to me is: > > if ((master + slave) < llc_size) > return 0; > > In other words, if the master flips and the slave flips are totally > higher than the llc_size, then we are most likely waking up too many > tasks as affine and should then switch to wide to prevent overloading. > > Digging further into the original patch from Michael Wang (I also CC'd > him), this was the code (before you had changed it to master/slave): > > wakee->nr_wakee_switch > factor && > waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) > > To explain the second condition above, Michael Wang said the following in [1] > > "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple > tasks rely on it, then waker's higher latency will damage all of them, pull > wakee seems to be a bad deal." > > Again I didn't follow why the second condition couldn't just be: > waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > wakee->nr_wakee_switch) > factor, based on the above explanation from > Micheal Wang that I quoted. > and why he's instead doing the whole multiplication thing there that I > was talking about earlier: "factor * wakee->nr_wakee_switch". > > Rephrasing my question in another way, why are we talking the ratio of > master/slave instead of the sum when comparing if its > factor? I am > surely missing something here. > > Just taking an example: > > Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 > slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common > between all 3 masters. Also to make it a bit more interesting, let s8 > wake up some random task T0. A diagram to show the master/slave > relation ships could look like: > > +-----+ > | | > +------------------------+ +------+ M2 | > | | | | | > | M1 | | +--+--+---- > | | | | | | | > | | | | | | s15 > +--+--+--+--+--+--+---+--+---+ v v v > | | | | | | | | s9 s10 s11 > v v v v v v v v > s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 > ^ > | > +-+---+ > | | > | M3 | > | | > +--+--+----- > | | | | > v v v v > s12 s13 s14 s16 > > > Lets consider the case of M1 waking up s8. As per the above diagram, > M1 has 8 flips and s8 has 4 flips. > > With llc_size = 3, the condition > > (slave < factor) would return FALSE, so then we would turn to the > (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), > so wake_wide would return 0 and would cause s8 to be woken up as > affine with relation to M1's core. > > So basically, it seems the heuristic is saying (with help of the > second condition - master < slave * factor). that Its a good idea for > s8 to be affine-woken-up with respect to M1's core. Why is it a good > idea to do that? It seems to me M1 has already several tasks its > waking as affine so causing s8 to be woken up affine could be harmful > and it may be a better choice to wake it up elsewhere. > > Thanks for your help! > > -Joel > > [1] https://lkml.org/lkml/2013/7/4/20 > > >> >> Something else to consider: network interrupt waking multiple workers >> at high frequency. If the waking CPU is idle, do you really want to >> place a worker directly in front of a tattoo artist, or is it better >> off nearly anywhere but there? >> >> If the box is virtual, with no topology exposed (or real but ancient) >> to let select_idle_sibling() come to the rescue, two workers can even >> get tattooed simultaneously (see sync wakeup). >> >> -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 8:13 ` Joel Fernandes @ 2017-08-02 8:26 ` Michael Wang 2017-08-03 23:48 ` Joel Fernandes 0 siblings, 1 reply; 29+ messages in thread From: Michael Wang @ 2017-08-02 8:26 UTC (permalink / raw) To: Joel Fernandes, Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Matt Fleming Hi, Joel On 07/29/2017 10:13 AM, Joel Fernandes wrote: > +Michael Wang on his current email address (old one bounced). (my > reply was to Mike Galbraith but I also meant to CC Michael Wang for > the discussion). Thanks Just back from vacation and saw this long long discussion... I think guys explained well on the idea, wake_wide() just try to filter out the cases that may congest the waker's domain, as much as possible without side effect (ideally). The factor at very beginning is just a static number which picked by enormous times of practice testing, Peter Zijlstr suggest we add the domain size and make it a flexible factor, which by practice not that bad. So the simple answer is we use the llc_size since no better option at that time :-) But things changing very fast and new feature can introduce new cases, I'm pretty sure if we redo the testing the results will be very different, however, the idea itself still make sense to me, at least on theory. Recently I'm also thinking about the scheduler issue, cfs try to find out general solution for all these cases and the best answer is obviously, all the cases will suffer some damage and scheduler itself bloated to achieve the goal 'taking care of all'. So in order to achieve the maximum performance of particular workload, some user defined scheduler would be an interesting idea :-P Regards, Michael Wang > > On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@google.com> wrote: >> Hi Mike, >> >> I have take spent some time understanding the email thread and >> previous discussions. Unfortunately the second condition we are >> checking for in the wake_wide still didn't make sense to me (mentioned >> below) :-( >> >> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >> <umgwanakikbuti@gmail.com> wrote: >>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>> >>>>> That makes sense that we multiply slave's flips by a factor because >>>>> its low, but I still didn't get why the factor is chosen to be >>>>> llc_size instead of something else for the multiplication with slave >>>>> (slave * factor). >>> >>>> Yeah I don't know why llc_size was chosen... >>> >>> static void update_top_cache_domain(int cpu) >>> { >>> struct sched_domain_shared *sds = NULL; >>> struct sched_domain *sd; >>> int id = cpu; >>> int size = 1; >>> >>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>> if (sd) { >>> id = cpumask_first(sched_domain_span(sd)); >>> size = cpumask_weight(sched_domain_span(sd)); >>> sds = sd->shared; >>> } >>> >>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>> per_cpu(sd_llc_size, cpu) = size; >>> >>> The goal of wake wide was to approximate when pulling would be a futile >>> consolidation effort and counterproductive to scaling. 'course with >>> ever increasing socket size, any 1:N waker is ever more likely to run >>> out of CPU for its one and only self (slamming into scaling wall) >>> before it needing to turn its minions loose to conquer the world. >> >> Actually the original question was why do we have the second condition >> as "master < slave * factor", instead of "master < factor". that's >> what didn't make sense to me. Why don't we return 0 from wake_wide if >> master < factor ? >> >> Infact, as the factor is set to the llc_size, I think the condition >> that makes sense to me is: >> >> if ((master + slave) < llc_size) >> return 0; >> >> In other words, if the master flips and the slave flips are totally >> higher than the llc_size, then we are most likely waking up too many >> tasks as affine and should then switch to wide to prevent overloading. >> >> Digging further into the original patch from Michael Wang (I also CC'd >> him), this was the code (before you had changed it to master/slave): >> >> wakee->nr_wakee_switch > factor && >> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) >> >> To explain the second condition above, Michael Wang said the following in [1] >> >> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >> tasks rely on it, then waker's higher latency will damage all of them, pull >> wakee seems to be a bad deal." >> >> Again I didn't follow why the second condition couldn't just be: >> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >> wakee->nr_wakee_switch) > factor, based on the above explanation from >> Micheal Wang that I quoted. >> and why he's instead doing the whole multiplication thing there that I >> was talking about earlier: "factor * wakee->nr_wakee_switch". >> >> Rephrasing my question in another way, why are we talking the ratio of >> master/slave instead of the sum when comparing if its > factor? I am >> surely missing something here. >> >> Just taking an example: >> >> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 >> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common >> between all 3 masters. Also to make it a bit more interesting, let s8 >> wake up some random task T0. A diagram to show the master/slave >> relation ships could look like: >> >> +-----+ >> | | >> +------------------------+ +------+ M2 | >> | | | | | >> | M1 | | +--+--+---- >> | | | | | | | >> | | | | | | s15 >> +--+--+--+--+--+--+---+--+---+ v v v >> | | | | | | | | s9 s10 s11 >> v v v v v v v v >> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 >> ^ >> | >> +-+---+ >> | | >> | M3 | >> | | >> +--+--+----- >> | | | | >> v v v v >> s12 s13 s14 s16 >> >> >> Lets consider the case of M1 waking up s8. As per the above diagram, >> M1 has 8 flips and s8 has 4 flips. >> >> With llc_size = 3, the condition >> >> (slave < factor) would return FALSE, so then we would turn to the >> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), >> so wake_wide would return 0 and would cause s8 to be woken up as >> affine with relation to M1's core. >> >> So basically, it seems the heuristic is saying (with help of the >> second condition - master < slave * factor). that Its a good idea for >> s8 to be affine-woken-up with respect to M1's core. Why is it a good >> idea to do that? It seems to me M1 has already several tasks its >> waking as affine so causing s8 to be woken up affine could be harmful >> and it may be a better choice to wake it up elsewhere. >> >> Thanks for your help! >> >> -Joel >> >> [1] https://lkml.org/lkml/2013/7/4/20 >> >> >>> >>> Something else to consider: network interrupt waking multiple workers >>> at high frequency. If the waking CPU is idle, do you really want to >>> place a worker directly in front of a tattoo artist, or is it better >>> off nearly anywhere but there? >>> >>> If the box is virtual, with no topology exposed (or real but ancient) >>> to let select_idle_sibling() come to the rescue, two workers can even >>> get tattooed simultaneously (see sync wakeup). >>> >>> -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-08-02 8:26 ` Michael Wang @ 2017-08-03 23:48 ` Joel Fernandes 0 siblings, 0 replies; 29+ messages in thread From: Joel Fernandes @ 2017-08-03 23:48 UTC (permalink / raw) To: Michael Wang Cc: Mike Galbraith, Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Matt Fleming Hi Michael, Thanks for your reply. On Wed, Aug 2, 2017 at 1:26 AM, Michael Wang <yun.wang@profitbricks.com> wrote: > Hi, Joel > > On 07/29/2017 10:13 AM, Joel Fernandes wrote: >> +Michael Wang on his current email address (old one bounced). (my >> reply was to Mike Galbraith but I also meant to CC Michael Wang for >> the discussion). Thanks > > Just back from vacation and saw this long long discussion... > > I think guys explained well on the idea, wake_wide() just try to filter > out the cases that may congest the waker's domain, as much as possible > without side effect (ideally). > > The factor at very beginning is just a static number which picked by > enormous times of practice testing, Peter Zijlstr suggest we add the > domain size and make it a flexible factor, which by practice not that bad. Yeah making it flexible might be better. > > So the simple answer is we use the llc_size since no better option at > that time :-) :-) > > But things changing very fast and new feature can introduce new cases, > I'm pretty sure if we redo the testing the results will be very different, > however, the idea itself still make sense to me, at least on theory. > > Recently I'm also thinking about the scheduler issue, cfs try to find out > general solution for all these cases and the best answer is obviously, all > the cases will suffer some damage and scheduler itself bloated to achieve > the goal 'taking care of all'. My team is definitely seeing some weirdness with wake_wide() itself not working correctly. We're collecting more data and do a more thorough analysis of the behavior. I think it can be improved. Will update soon as we have something. thanks, -Joel > > Regards, > Michael Wang > > > >> >> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@google.com> wrote: >>> Hi Mike, >>> >>> I have take spent some time understanding the email thread and >>> previous discussions. Unfortunately the second condition we are >>> checking for in the wake_wide still didn't make sense to me (mentioned >>> below) :-( >>> >>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >>> <umgwanakikbuti@gmail.com> wrote: >>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>>> >>>>>> That makes sense that we multiply slave's flips by a factor because >>>>>> its low, but I still didn't get why the factor is chosen to be >>>>>> llc_size instead of something else for the multiplication with slave >>>>>> (slave * factor). >>>> >>>>> Yeah I don't know why llc_size was chosen... >>>> >>>> static void update_top_cache_domain(int cpu) >>>> { >>>> struct sched_domain_shared *sds = NULL; >>>> struct sched_domain *sd; >>>> int id = cpu; >>>> int size = 1; >>>> >>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>>> if (sd) { >>>> id = cpumask_first(sched_domain_span(sd)); >>>> size = cpumask_weight(sched_domain_span(sd)); >>>> sds = sd->shared; >>>> } >>>> >>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>>> per_cpu(sd_llc_size, cpu) = size; >>>> >>>> The goal of wake wide was to approximate when pulling would be a futile >>>> consolidation effort and counterproductive to scaling. 'course with >>>> ever increasing socket size, any 1:N waker is ever more likely to run >>>> out of CPU for its one and only self (slamming into scaling wall) >>>> before it needing to turn its minions loose to conquer the world. >>> >>> Actually the original question was why do we have the second condition >>> as "master < slave * factor", instead of "master < factor". that's >>> what didn't make sense to me. Why don't we return 0 from wake_wide if >>> master < factor ? >>> >>> Infact, as the factor is set to the llc_size, I think the condition >>> that makes sense to me is: >>> >>> if ((master + slave) < llc_size) >>> return 0; >>> >>> In other words, if the master flips and the slave flips are totally >>> higher than the llc_size, then we are most likely waking up too many >>> tasks as affine and should then switch to wide to prevent overloading. >>> >>> Digging further into the original patch from Michael Wang (I also CC'd >>> him), this was the code (before you had changed it to master/slave): >>> >>> wakee->nr_wakee_switch > factor && >>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) >>> >>> To explain the second condition above, Michael Wang said the following in [1] >>> >>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >>> tasks rely on it, then waker's higher latency will damage all of them, pull >>> wakee seems to be a bad deal." >>> >>> Again I didn't follow why the second condition couldn't just be: >>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >>> wakee->nr_wakee_switch) > factor, based on the above explanation from >>> Micheal Wang that I quoted. >>> and why he's instead doing the whole multiplication thing there that I >>> was talking about earlier: "factor * wakee->nr_wakee_switch". >>> >>> Rephrasing my question in another way, why are we talking the ratio of >>> master/slave instead of the sum when comparing if its > factor? I am >>> surely missing something here. >>> >>> Just taking an example: >>> >>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 >>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common >>> between all 3 masters. Also to make it a bit more interesting, let s8 >>> wake up some random task T0. A diagram to show the master/slave >>> relation ships could look like: >>> >>> +-----+ >>> | | >>> +------------------------+ +------+ M2 | >>> | | | | | >>> | M1 | | +--+--+---- >>> | | | | | | | >>> | | | | | | s15 >>> +--+--+--+--+--+--+---+--+---+ v v v >>> | | | | | | | | s9 s10 s11 >>> v v v v v v v v >>> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 >>> ^ >>> | >>> +-+---+ >>> | | >>> | M3 | >>> | | >>> +--+--+----- >>> | | | | >>> v v v v >>> s12 s13 s14 s16 >>> >>> >>> Lets consider the case of M1 waking up s8. As per the above diagram, >>> M1 has 8 flips and s8 has 4 flips. >>> >>> With llc_size = 3, the condition >>> >>> (slave < factor) would return FALSE, so then we would turn to the >>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), >>> so wake_wide would return 0 and would cause s8 to be woken up as >>> affine with relation to M1's core. >>> >>> So basically, it seems the heuristic is saying (with help of the >>> second condition - master < slave * factor). that Its a good idea for >>> s8 to be affine-woken-up with respect to M1's core. Why is it a good >>> idea to do that? It seems to me M1 has already several tasks its >>> waking as affine so causing s8 to be woken up affine could be harmful >>> and it may be a better choice to wake it up elsewhere. >>> >>> Thanks for your help! >>> >>> -Joel >>> >>> [1] https://lkml.org/lkml/2013/7/4/20 >>> >>> >>>> >>>> Something else to consider: network interrupt waking multiple workers >>>> at high frequency. If the waking CPU is idle, do you really want to >>>> place a worker directly in front of a tattoo artist, or is it better >>>> off nearly anywhere but there? >>>> >>>> If the box is virtual, with no topology exposed (or real but ancient) >>>> to let select_idle_sibling() come to the rescue, two workers can even >>>> get tattooed simultaneously (see sync wakeup). >>>> >>>> -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 8:01 ` Joel Fernandes 2017-07-29 8:13 ` Joel Fernandes @ 2017-07-29 15:07 ` Mike Galbraith 2017-07-29 20:19 ` Joel Fernandes 1 sibling, 1 reply; 29+ messages in thread From: Mike Galbraith @ 2017-07-29 15:07 UTC (permalink / raw) To: Joel Fernandes Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote: > Hi Mike, > > I have take spent some time understanding the email thread and > previous discussions. Unfortunately the second condition we are > checking for in the wake_wide still didn't make sense to me (mentioned > below) :-( > > On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith > <umgwanakikbuti@gmail.com> wrote: > > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > >> > >> > That makes sense that we multiply slave's flips by a factor because > >> > its low, but I still didn't get why the factor is chosen to be > >> > llc_size instead of something else for the multiplication with slave > >> > (slave * factor). > > > >> Yeah I don't know why llc_size was chosen... > > > > static void update_top_cache_domain(int cpu) > > { > > struct sched_domain_shared *sds = NULL; > > struct sched_domain *sd; > > int id = cpu; > > int size = 1; > > > > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > > if (sd) { > > id = cpumask_first(sched_domain_span(sd)); > > size = cpumask_weight(sched_domain_span(sd)); > > sds = sd->shared; > > } > > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > > per_cpu(sd_llc_size, cpu) = size; > > > > The goal of wake wide was to approximate when pulling would be a futile > > consolidation effort and counterproductive to scaling. 'course with > > ever increasing socket size, any 1:N waker is ever more likely to run > > out of CPU for its one and only self (slamming into scaling wall) > > before it needing to turn its minions loose to conquer the world. > > Actually the original question was why do we have the second condition > as "master < slave * factor", instead of "master < factor". that's > what didn't make sense to me. Why don't we return 0 from wake_wide if > master < factor ? > > Infact, as the factor is set to the llc_size, I think the condition > that makes sense to me is: > > if ((master + slave) < llc_size) > return 0; That says to me turn affine wakeups off for nearly everything. > In other words, if the master flips and the slave flips are totally > higher than the llc_size, then we are most likely waking up too many > tasks as affine and should then switch to wide to prevent overloading. The heuristic is mostly about the ratio of flip counts. > Digging further into the original patch from Michael Wang (I also CC'd > him), this was the code (before you had changed it to master/slave): > > wakee->nr_wakee_switch > factor && > waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) Yeah, it was originally unidirectional. > To explain the second condition above, Michael Wang said the following in [1] > > "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple > tasks rely on it, then waker's higher latency will damage all of them, pull > wakee seems to be a bad deal." Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is the one flipping partners at least N/s, and the other about N times as often? If so, the two may be part of a too big to wisely pull 1:N. If you have a better idea, by all means, pull it out. Nobody is attached to wake_wide(), in fact, I suspect Peter hates it. I'm not fond of it either, it having obvious holes. The only thing it has going for it is simplicity. Bend it up, replace it, fire away. > Again I didn't follow why the second condition couldn't just be: > waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > wakee->nr_wakee_switch) > factor, based on the above explanation from > Micheal Wang that I quoted. > and why he's instead doing the whole multiplication thing there that I > was talking about earlier: "factor * wakee->nr_wakee_switch". > > Rephrasing my question in another way, why are we talking the ratio of > master/slave instead of the sum when comparing if its > factor? I am > surely missing something here. Because the heuristic tries to not demolish 1:1 buddies. Big partner flip delta means the pair are unlikely to be a communicating pair, perhaps at high frequency where misses hurt like hell. > Just taking an example: > > Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 > slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common > between all 3 masters. Also to make it a bit more interesting, let s8 > wake up some random task T0. A diagram to show the master/slave > relation ships could look like: I don't need the artwork, as soon as you describe a squid orgie in a bucket with no adult supervision, I can visualize what happens: the load balancer tries to make pedantically perfect load numbers, caring not one whit about any of the relationships in either text or graphic description, stirs bucket vigorously, making squid soup. IMHO, placement optimization of that is a job for a human, the scheduler ain't that smart. (quick like bunny -> smart like bunny) > > +-----+ > | | > +------------------------+ +------+ M2 | > | | | | | > | M1 | | +--+--+---- > | | | | | | | > | | | | | | s15 > +--+--+--+--+--+--+---+--+---+ v v v > | | | | | | | | s9 s10 s11 > v v v v v v v v > s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 > ^ > | > +-+---+ > | | > | M3 | > | | > +--+--+----- > | | | | > v v v v > s12 s13 s14 s16 > > > Lets consider the case of M1 waking up s8. As per the above diagram, > M1 has 8 flips and s8 has 4 flips. > > With llc_size = 3, the condition > > (slave < factor) would return FALSE, so then we would turn to the > (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), > so wake_wide would return 0 and would cause s8 to be woken up as > affine with relation to M1's core. > > So basically, it seems the heuristic is saying (with help of the > second condition - master < slave * factor). that Its a good idea for > s8 to be affine-woken-up with respect to M1's core. Why is it a good > idea to do that? It seems to me M1 has already several tasks its > waking as affine so causing s8 to be woken up affine could be harmful > and it may be a better choice to wake it up elsewhere. No matter what you do with s8, you'll be wrong from some POV.. unless of course s8 is a high frequency waker, then migration is a good bet. -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 15:07 ` Mike Galbraith @ 2017-07-29 20:19 ` Joel Fernandes 2017-07-29 22:28 ` Joel Fernandes 0 siblings, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-07-29 20:19 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming Hi Mike, On Sat, Jul 29, 2017 at 8:07 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote: >> Hi Mike, >> >> I have take spent some time understanding the email thread and >> previous discussions. Unfortunately the second condition we are >> checking for in the wake_wide still didn't make sense to me (mentioned >> below) :-( >> >> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >> <umgwanakikbuti@gmail.com> wrote: >> > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >> >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >> >> >> >> > That makes sense that we multiply slave's flips by a factor because >> >> > its low, but I still didn't get why the factor is chosen to be >> >> > llc_size instead of something else for the multiplication with slave >> >> > (slave * factor). >> > >> >> Yeah I don't know why llc_size was chosen... <snip> >> > >> > The goal of wake wide was to approximate when pulling would be a futile >> > consolidation effort and counterproductive to scaling. 'course with >> > ever increasing socket size, any 1:N waker is ever more likely to run >> > out of CPU for its one and only self (slamming into scaling wall) >> > before it needing to turn its minions loose to conquer the world. >> >> Actually the original question was why do we have the second condition >> as "master < slave * factor", instead of "master < factor". that's >> what didn't make sense to me. Why don't we return 0 from wake_wide if >> master < factor ? >> >> Infact, as the factor is set to the llc_size, I think the condition >> that makes sense to me is: >> >> if ((master + slave) < llc_size) >> return 0; > > That says to me turn affine wakeups off for nearly everything. Ok, I see that now. thanks >> To explain the second condition above, Michael Wang said the following in [1] >> >> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >> tasks rely on it, then waker's higher latency will damage all of them, pull >> wakee seems to be a bad deal." > > Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is > the one flipping partners at least N/s, and the other about N times as > often? If so, the two may be part of a too big to wisely pull 1:N. > > If you have a better idea, by all means, pull it out. Nobody is Sure yeah, first I'm trying to understand the heuristic itself which I'm glad to be making progress with thanks to yours and others' help! > attached to wake_wide(), in fact, I suspect Peter hates it. I'm not > fond of it either, it having obvious holes. The only thing it has > going for it is simplicity. Bend it up, replace it, fire away. > Ok, it makes much more sense to me now. Also for the N:N case, wouldn't the excessive wake-affine increase the latency and a spreading might be better? Say if slave and master flips are much greater than factor (llc_size), then slave > factor && master < slave * factor, would probably return true a lot (and we would return 0 causing an affine wakeup). That's probably a bad thing right as it could overload the waker's CPU quickly? I guess the heuristic tries to maximize cache-hits more than reduce latency? >> Again I didn't follow why the second condition couldn't just be: >> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >> wakee->nr_wakee_switch) > factor, based on the above explanation from >> Micheal Wang that I quoted. >> and why he's instead doing the whole multiplication thing there that I >> was talking about earlier: "factor * wakee->nr_wakee_switch". >> >> Rephrasing my question in another way, why are we talking the ratio of >> master/slave instead of the sum when comparing if its > factor? I am >> surely missing something here. > > Because the heuristic tries to not demolish 1:1 buddies. Big partner > flip delta means the pair are unlikely to be a communicating pair, > perhaps at high frequency where misses hurt like hell. But it does seem to me to demolish the N:N communicating pairs from a latency/load balancing standpoint. For he case of N readers and N writers, the ratio (master/slave) comes down to 1:1 and we wake affine. Hopefully I didn't miss something too obvious about that. thanks, -Joel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 20:19 ` Joel Fernandes @ 2017-07-29 22:28 ` Joel Fernandes 2017-07-29 22:41 ` Joel Fernandes 0 siblings, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-07-29 22:28 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming Hi Mike, On Sat, Jul 29, 2017 at 1:19 PM, Joel Fernandes <joelaf@google.com> wrote: <snip> > >>> To explain the second condition above, Michael Wang said the following in [1] >>> >>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >>> tasks rely on it, then waker's higher latency will damage all of them, pull >>> wakee seems to be a bad deal." >> >> Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is >> the one flipping partners at least N/s, and the other about N times as >> often? If so, the two may be part of a too big to wisely pull 1:N. >> >> If you have a better idea, by all means, pull it out. Nobody is > > Sure yeah, first I'm trying to understand the heuristic itself which > I'm glad to be making progress with thanks to yours and others' help! > >> attached to wake_wide(), in fact, I suspect Peter hates it. I'm not >> fond of it either, it having obvious holes. The only thing it has >> going for it is simplicity. Bend it up, replace it, fire away. >> > > Ok, it makes much more sense to me now. Also for the N:N case, > wouldn't the excessive wake-affine increase the latency and a > spreading might be better? Say if slave and master flips are much > greater than factor (llc_size), then slave > factor && master < slave > * factor, would probably return true a lot (and we would return 0 > causing an affine wakeup). That's probably a bad thing right as it > could overload the waker's CPU quickly? I guess the heuristic tries to > maximize cache-hits more than reduce latency? > >>> Again I didn't follow why the second condition couldn't just be: >>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >>> wakee->nr_wakee_switch) > factor, based on the above explanation from >>> Micheal Wang that I quoted. >>> and why he's instead doing the whole multiplication thing there that I >>> was talking about earlier: "factor * wakee->nr_wakee_switch". >>> >>> Rephrasing my question in another way, why are we talking the ratio of >>> master/slave instead of the sum when comparing if its > factor? I am >>> surely missing something here. >> >> Because the heuristic tries to not demolish 1:1 buddies. Big partner >> flip delta means the pair are unlikely to be a communicating pair, >> perhaps at high frequency where misses hurt like hell. > > But it does seem to me to demolish the N:N communicating pairs from a > latency/load balancing standpoint. For he case of N readers and N > writers, the ratio (master/slave) comes down to 1:1 and we wake > affine. Hopefully I didn't miss something too obvious about that. I think wake_affine() should correctly handle the case (of overloading) I bring up here where wake_wide() is too conservative and does affine a lot, (I don't have any data for this though, this just from code reading), so I take this comment back for this reason. thanks, -Joel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 22:28 ` Joel Fernandes @ 2017-07-29 22:41 ` Joel Fernandes 2017-07-31 12:21 ` Josef Bacik 0 siblings, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-07-29 22:41 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <joelaf@google.com> wrote: <snip> >>>> Again I didn't follow why the second condition couldn't just be: >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from >>>> Micheal Wang that I quoted. >>>> and why he's instead doing the whole multiplication thing there that I >>>> was talking about earlier: "factor * wakee->nr_wakee_switch". >>>> >>>> Rephrasing my question in another way, why are we talking the ratio of >>>> master/slave instead of the sum when comparing if its > factor? I am >>>> surely missing something here. >>> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner >>> flip delta means the pair are unlikely to be a communicating pair, >>> perhaps at high frequency where misses hurt like hell. >> >> But it does seem to me to demolish the N:N communicating pairs from a >> latency/load balancing standpoint. For he case of N readers and N >> writers, the ratio (master/slave) comes down to 1:1 and we wake >> affine. Hopefully I didn't miss something too obvious about that. > > I think wake_affine() should correctly handle the case (of > overloading) I bring up here where wake_wide() is too conservative and > does affine a lot, (I don't have any data for this though, this just > from code reading), so I take this comment back for this reason. aargh, nope :( it still runs select_idle_sibling although on the previous CPU even if want_affine is 0 (and doesn't do the wider wakeup..), so the comment still applies.. its easy to get lost into the code with so many if statements :-\ sorry about the noise :) thanks, -Joel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-29 22:41 ` Joel Fernandes @ 2017-07-31 12:21 ` Josef Bacik 2017-07-31 13:42 ` Mike Galbraith 2017-07-31 16:21 ` Joel Fernandes 0 siblings, 2 replies; 29+ messages in thread From: Josef Bacik @ 2017-07-31 12:21 UTC (permalink / raw) To: Joel Fernandes Cc: Mike Galbraith, Josef Bacik, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote: > On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <joelaf@google.com> wrote: > <snip> > >>>> Again I didn't follow why the second condition couldn't just be: > >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from > >>>> Micheal Wang that I quoted. > >>>> and why he's instead doing the whole multiplication thing there that I > >>>> was talking about earlier: "factor * wakee->nr_wakee_switch". > >>>> > >>>> Rephrasing my question in another way, why are we talking the ratio of > >>>> master/slave instead of the sum when comparing if its > factor? I am > >>>> surely missing something here. > >>> > >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner > >>> flip delta means the pair are unlikely to be a communicating pair, > >>> perhaps at high frequency where misses hurt like hell. > >> > >> But it does seem to me to demolish the N:N communicating pairs from a > >> latency/load balancing standpoint. For he case of N readers and N > >> writers, the ratio (master/slave) comes down to 1:1 and we wake > >> affine. Hopefully I didn't miss something too obvious about that. > > > > I think wake_affine() should correctly handle the case (of > > overloading) I bring up here where wake_wide() is too conservative and > > does affine a lot, (I don't have any data for this though, this just > > from code reading), so I take this comment back for this reason. > > aargh, nope :( it still runs select_idle_sibling although on the > previous CPU even if want_affine is 0 (and doesn't do the wider > wakeup..), so the comment still applies.. its easy to get lost into > the code with so many if statements :-\ sorry about the noise :) > I've been working in this area recently because of a cpu imbalance problem. Wake_wide() definitely makes it so we're waking affine way too often, but I think messing with wake_waide to solve that problem is the wrong solution. This is just a heuristic to see if we should wake affine, the simpler the better. I solved the problem of waking affine too often like this https://marc.info/?l=linux-kernel&m=150003849602535&w=2 So why do you care about wake_wide() anyway? Are you observing some problem that you suspect is affected by the affine wakeup stuff? Or are you just trying to understand what is going on for fun? Cause if you are just doing this for fun you are a very strange person, thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 12:21 ` Josef Bacik @ 2017-07-31 13:42 ` Mike Galbraith 2017-07-31 14:48 ` Josef Bacik 2017-07-31 16:21 ` Joel Fernandes 1 sibling, 1 reply; 29+ messages in thread From: Mike Galbraith @ 2017-07-31 13:42 UTC (permalink / raw) To: Josef Bacik, Joel Fernandes Cc: Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote: > > I've been working in this area recently because of a cpu imbalance problem. > Wake_wide() definitely makes it so we're waking affine way too often, but I > think messing with wake_waide to solve that problem is the wrong solution. This > is just a heuristic to see if we should wake affine, the simpler the better. I > solved the problem of waking affine too often like this > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2 Wait a minute, that's not quite fair :) Wake_wide() can't be blamed for causing too frequent affine wakeups when what it does is filter some out. While it may not reject aggressively enough for you (why you bent it up to be very aggressive), seems the problem from your loads POV is the scheduler generally being too eager to bounce. I've also played with rate limiting migration per task, but it had negative effects too: when idle/periodic balance pulls buddies apart, rate limiting inhibits them quickly finding each other again, making undoing all that hard load balancer work a throughput win. Sigh. -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 13:42 ` Mike Galbraith @ 2017-07-31 14:48 ` Josef Bacik 2017-07-31 17:23 ` Mike Galbraith 0 siblings, 1 reply; 29+ messages in thread From: Josef Bacik @ 2017-07-31 14:48 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Mon, Jul 31, 2017 at 03:42:25PM +0200, Mike Galbraith wrote: > On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote: > > > > I've been working in this area recently because of a cpu imbalance problem. > > Wake_wide() definitely makes it so we're waking affine way too often, but I > > think messing with wake_waide to solve that problem is the wrong solution. This > > is just a heuristic to see if we should wake affine, the simpler the better. I > > solved the problem of waking affine too often like this > > > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2 > > Wait a minute, that's not quite fair :) Wake_wide() can't be blamed > for causing too frequent affine wakeups when what it does is filter > some out. While it may not reject aggressively enough for you (why you > bent it up to be very aggressive), seems the problem from your loads > POV is the scheduler generally being too eager to bounce. > Yeah sorry, I hate this stuff because it's so hard to talk about without mixing up different ideas. I should say the scheduler in general prefers to wake affine super hard, and wake_wide() is conservative in it's filtering of this behavior. The rest still holds true, I think tinkering with it is just hard and the wrong place to do it, it's a good first step, and we can be smarter further down. > I've also played with rate limiting migration per task, but it had > negative effects too: when idle/periodic balance pulls buddies apart, > rate limiting inhibits them quickly finding each other again, making > undoing all that hard load balancer work a throughput win. Sigh. > That's why I did the HZ thing, we don't touch the task for HZ to let things settle out, and then allow affine wakeups after that. Now HZ may be an eternity in scheduler time, but I think its a good middle ground. For our case the box is loaded constantly, so we basically never want affine wakeups for our app. For the case where there's spikey behavior we'll return to normal affine wakeups a short while later. But from my admittedly limited testing it appears to be a win overall. Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 14:48 ` Josef Bacik @ 2017-07-31 17:23 ` Mike Galbraith 0 siblings, 0 replies; 29+ messages in thread From: Mike Galbraith @ 2017-07-31 17:23 UTC (permalink / raw) To: Josef Bacik Cc: Joel Fernandes, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Mon, 2017-07-31 at 14:48 +0000, Josef Bacik wrote: > On Mon, Jul 31, 2017 at 03:42:25PM +0200, Mike Galbraith wrote: > > On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote: > > > > > > I've been working in this area recently because of a cpu imbalance problem. > > > Wake_wide() definitely makes it so we're waking affine way too often, but I > > > think messing with wake_waide to solve that problem is the wrong solution. This > > > is just a heuristic to see if we should wake affine, the simpler the better. I > > > solved the problem of waking affine too often like this > > > > > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2 > > > > Wait a minute, that's not quite fair :) Wake_wide() can't be blamed > > for causing too frequent affine wakeups when what it does is filter > > some out. While it may not reject aggressively enough for you (why you > > bent it up to be very aggressive), seems the problem from your loads > > POV is the scheduler generally being too eager to bounce. > > > > Yeah sorry, I hate this stuff because it's so hard to talk about without mixing > up different ideas. I should say the scheduler in general prefers to wake > affine super hard, and wake_wide() is conservative in it's filtering of this > behavior. The rest still holds true, I think tinkering with it is just hard and > the wrong place to do it, it's a good first step, and we can be smarter further > down. Yeah, it's hard, and yeah, bottom line remains unchanged. > > I've also played with rate limiting migration per task, but it had > > negative effects too: when idle/periodic balance pulls buddies apart, > > rate limiting inhibits them quickly finding each other again, making > > undoing all that hard load balancer work a throughput win. Sigh. > > > > That's why I did the HZ thing, we don't touch the task for HZ to let things > settle out, and then allow affine wakeups after that. I kinda like the way you did it better than what I tried, but until a means exists to _target_ the win, it's gonna be rob Peter to pay Paul, swap rolls, repeat endlessly. -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 12:21 ` Josef Bacik 2017-07-31 13:42 ` Mike Galbraith @ 2017-07-31 16:21 ` Joel Fernandes 2017-07-31 16:42 ` Josef Bacik 1 sibling, 1 reply; 29+ messages in thread From: Joel Fernandes @ 2017-07-31 16:21 UTC (permalink / raw) To: Josef Bacik Cc: Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming Hi Josef, On Mon, Jul 31, 2017 at 5:21 AM, Josef Bacik <josef@toxicpanda.com> wrote: > On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote: >> On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <joelaf@google.com> wrote: >> <snip> >> >>>> Again I didn't follow why the second condition couldn't just be: >> >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >> >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from >> >>>> Micheal Wang that I quoted. >> >>>> and why he's instead doing the whole multiplication thing there that I >> >>>> was talking about earlier: "factor * wakee->nr_wakee_switch". >> >>>> >> >>>> Rephrasing my question in another way, why are we talking the ratio of >> >>>> master/slave instead of the sum when comparing if its > factor? I am >> >>>> surely missing something here. >> >>> >> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner >> >>> flip delta means the pair are unlikely to be a communicating pair, >> >>> perhaps at high frequency where misses hurt like hell. >> >> >> >> But it does seem to me to demolish the N:N communicating pairs from a >> >> latency/load balancing standpoint. For he case of N readers and N >> >> writers, the ratio (master/slave) comes down to 1:1 and we wake >> >> affine. Hopefully I didn't miss something too obvious about that. >> > >> > I think wake_affine() should correctly handle the case (of >> > overloading) I bring up here where wake_wide() is too conservative and >> > does affine a lot, (I don't have any data for this though, this just >> > from code reading), so I take this comment back for this reason. >> >> aargh, nope :( it still runs select_idle_sibling although on the >> previous CPU even if want_affine is 0 (and doesn't do the wider >> wakeup..), so the comment still applies.. its easy to get lost into >> the code with so many if statements :-\ sorry about the noise :) >> > > I've been working in this area recently because of a cpu imbalance problem. > Wake_wide() definitely makes it so we're waking affine way too often, but I > think messing with wake_waide to solve that problem is the wrong solution. This > is just a heuristic to see if we should wake affine, the simpler the better. I > solved the problem of waking affine too often like this > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2 Thanks! Cool! > > So why do you care about wake_wide() anyway? Are you observing some problem > that you suspect is affected by the affine wakeup stuff? Or are you just trying I am dealing with an affine wake up issue, yes. > to understand what is going on for fun? Cause if you are just doing this for > fun you are a very strange person, thanks, Its not just for fun :) Let me give you some background about me, I work in the Android team and one of the things I want to do is to take an out of tree patch that's been carried for some time and post a more upstreamable solution - this is related to wake ups from the binder driver which does sync wake ups (WF_SYNC). I can't find the exact out of tree patch publicly since it wasn't posted to a list, but the code is here [1]. What's worse is I have recently found really bad issues with this patch itself where runnable times are increased. I should have provided this background earlier (sorry that I didn't, my plan was to trigger a separate discussion about the binder sync wake up thing as a part of a patch/proposal I want to post - which I plan to do so). Anyway, as a part of this effort, I want to understand wake_wide() better and "respect" it since it sits in the wake up path and I wanted to my proposal to work well with it, especially since I want to solve this problem in an upstream-friendly way. The other reason to trigger the discussion, is, I have seen wake_wide() enough number of times and asked enough number of folks how it works that it seems sensible to ask about it here (I was also suggested to ask about wake_wide on LKML because since few people seemingly understand how it works) and hopefully now its a bit better understood. I agree with you that instead of spending insane amounts of time on wake_wide itself, its better to directly work on a problem and collect some data - which is also what I'm doing, but I still thought its worth doing some digging into wake_wide() during some free time I had, thanks. Cheers, -Joel [1] https://android.googlesource.com/kernel/msm.git/+/377e6e28b6097b3d6de7245d3d3def45fc8c9ffc/kernel/sched/fair.c#5492 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 16:21 ` Joel Fernandes @ 2017-07-31 16:42 ` Josef Bacik 2017-07-31 17:55 ` Joel Fernandes 0 siblings, 1 reply; 29+ messages in thread From: Josef Bacik @ 2017-07-31 16:42 UTC (permalink / raw) To: Joel Fernandes Cc: Josef Bacik, Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Mon, Jul 31, 2017 at 09:21:46AM -0700, Joel Fernandes wrote: > Hi Josef, > > On Mon, Jul 31, 2017 at 5:21 AM, Josef Bacik <josef@toxicpanda.com> wrote: > > On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote: > >> On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <joelaf@google.com> wrote: > >> <snip> > >> >>>> Again I didn't follow why the second condition couldn't just be: > >> >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > >> >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from > >> >>>> Micheal Wang that I quoted. > >> >>>> and why he's instead doing the whole multiplication thing there that I > >> >>>> was talking about earlier: "factor * wakee->nr_wakee_switch". > >> >>>> > >> >>>> Rephrasing my question in another way, why are we talking the ratio of > >> >>>> master/slave instead of the sum when comparing if its > factor? I am > >> >>>> surely missing something here. > >> >>> > >> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner > >> >>> flip delta means the pair are unlikely to be a communicating pair, > >> >>> perhaps at high frequency where misses hurt like hell. > >> >> > >> >> But it does seem to me to demolish the N:N communicating pairs from a > >> >> latency/load balancing standpoint. For he case of N readers and N > >> >> writers, the ratio (master/slave) comes down to 1:1 and we wake > >> >> affine. Hopefully I didn't miss something too obvious about that. > >> > > >> > I think wake_affine() should correctly handle the case (of > >> > overloading) I bring up here where wake_wide() is too conservative and > >> > does affine a lot, (I don't have any data for this though, this just > >> > from code reading), so I take this comment back for this reason. > >> > >> aargh, nope :( it still runs select_idle_sibling although on the > >> previous CPU even if want_affine is 0 (and doesn't do the wider > >> wakeup..), so the comment still applies.. its easy to get lost into > >> the code with so many if statements :-\ sorry about the noise :) > >> > > > > I've been working in this area recently because of a cpu imbalance problem. > > Wake_wide() definitely makes it so we're waking affine way too often, but I > > think messing with wake_waide to solve that problem is the wrong solution. This > > is just a heuristic to see if we should wake affine, the simpler the better. I > > solved the problem of waking affine too often like this > > > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2 > > Thanks! Cool! > > > > > So why do you care about wake_wide() anyway? Are you observing some problem > > that you suspect is affected by the affine wakeup stuff? Or are you just trying > > I am dealing with an affine wake up issue, yes. > > > to understand what is going on for fun? Cause if you are just doing this for > > fun you are a very strange person, thanks, > > Its not just for fun :) Let me give you some background about me, I > work in the Android team and one of the things I want to do is to take > an out of tree patch that's been carried for some time and post a more > upstreamable solution - this is related to wake ups from the binder > driver which does sync wake ups (WF_SYNC). I can't find the exact out > of tree patch publicly since it wasn't posted to a list, but the code > is here [1]. What's worse is I have recently found really bad issues > with this patch itself where runnable times are increased. I should > have provided this background earlier (sorry that I didn't, my plan > was to trigger a separate discussion about the binder sync wake up > thing as a part of a patch/proposal I want to post - which I plan to > do so). Anyway, as a part of this effort, I want to understand > wake_wide() better and "respect" it since it sits in the wake up path > and I wanted to my proposal to work well with it, especially since I > want to solve this problem in an upstream-friendly way. > > The other reason to trigger the discussion, is, I have seen > wake_wide() enough number of times and asked enough number of folks > how it works that it seems sensible to ask about it here (I was also > suggested to ask about wake_wide on LKML because since few people > seemingly understand how it works) and hopefully now its a bit better > understood. > > I agree with you that instead of spending insane amounts of time on > wake_wide itself, its better to directly work on a problem and collect > some data - which is also what I'm doing, but I still thought its > worth doing some digging into wake_wide() during some free time I had, > thanks. > Ok so your usecase is to _always_ wake affine if we're doing a sync wakeup. I _think_ for your case it's best to make wake_affine() make this decision, and you don't want wake_wide() to filter out your wakeup as not-affine? So perhaps just throw a test in there to not wake wide if WF_SYNC is set. This makes logical sense to me as synchronous wakeups are probably going to want to be affine wakeups, and then we can rely on wake_affine() to do the load checks to make sure it really makes sense. How does that sound? Thanks, Josef ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-07-31 16:42 ` Josef Bacik @ 2017-07-31 17:55 ` Joel Fernandes 0 siblings, 0 replies; 29+ messages in thread From: Joel Fernandes @ 2017-07-31 17:55 UTC (permalink / raw) To: Josef Bacik Cc: Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Michael Wang, Matt Fleming On Mon, Jul 31, 2017 at 9:42 AM, Josef Bacik <josef@toxicpanda.com> wrote: <snip> >> >> > >> > So why do you care about wake_wide() anyway? Are you observing some problem >> > that you suspect is affected by the affine wakeup stuff? Or are you just trying >> >> I am dealing with an affine wake up issue, yes. >> >> > to understand what is going on for fun? Cause if you are just doing this for >> > fun you are a very strange person, thanks, >> >> Its not just for fun :) Let me give you some background about me, I >> work in the Android team and one of the things I want to do is to take >> an out of tree patch that's been carried for some time and post a more >> upstreamable solution - this is related to wake ups from the binder >> driver which does sync wake ups (WF_SYNC). I can't find the exact out >> of tree patch publicly since it wasn't posted to a list, but the code >> is here [1]. What's worse is I have recently found really bad issues >> with this patch itself where runnable times are increased. I should >> have provided this background earlier (sorry that I didn't, my plan >> was to trigger a separate discussion about the binder sync wake up >> thing as a part of a patch/proposal I want to post - which I plan to >> do so). Anyway, as a part of this effort, I want to understand >> wake_wide() better and "respect" it since it sits in the wake up path >> and I wanted to my proposal to work well with it, especially since I >> want to solve this problem in an upstream-friendly way. >> >> The other reason to trigger the discussion, is, I have seen >> wake_wide() enough number of times and asked enough number of folks >> how it works that it seems sensible to ask about it here (I was also >> suggested to ask about wake_wide on LKML because since few people >> seemingly understand how it works) and hopefully now its a bit better >> understood. >> >> I agree with you that instead of spending insane amounts of time on >> wake_wide itself, its better to directly work on a problem and collect >> some data - which is also what I'm doing, but I still thought its >> worth doing some digging into wake_wide() during some free time I had, >> thanks. >> > > Ok so your usecase is to _always_ wake affine if we're doing a sync wakeup. I > _think_ for your case it's best to make wake_affine() make this decision, and > you don't want wake_wide() to filter out your wakeup as not-affine? So perhaps > just throw a test in there to not wake wide if WF_SYNC is set. This makes Hmm I was actually thinking that since 'sync' is more of a hint, that we do a wake_wide() first anyway since its already so conservative, and for the times it does resort to wide, its probably the right decision from a scheduling standpoint to avoid affine and avoid too many tasks too quickly. Do you think that's a fair? I tried a quick patch and doing wake_wide first, and then checking for sync does seem to work well. > logical sense to me as synchronous wakeups are probably going to want to be > affine wakeups, and then we can rely on wake_affine() to do the load checks to > make sure it really makes sense. How does that sound? Thanks, Yep that sounds good and I will try that. What I was thinking was do the regular wake_wide and wake_affine checks, and then do something like this: in select_task_rq_fair, before calling select_idle_sibling, do something like this to check if only 1 task is running on the waker's CPU (this is after doing the wake_wide and wake_affine checks) + idle_sync = (sync && (new_cpu == cpu) && + cpu_rq(cpu)->nr_running == 1); and then in select_idle_sibling, something like: + /* + * If the previous and target CPU share cache and a sync wakeup + * is requested and the CPU is about to goto idle, because it + * has only the waker running which requested sync, the target + * is a better choice for cache affinity and keeping task's + * previous core idle and low power state. + */ + if (idle_sync && cpus_share_cache(prev, target)) + return target; + I haven't tested this patch though so I'm not sure it works well yet, but just sharing the idea. What do you think? thanks, -Joel ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 0:49 ` Josef Bacik 2017-06-30 3:04 ` Joel Fernandes @ 2017-06-30 3:11 ` Mike Galbraith 2017-06-30 13:11 ` Matt Fleming 2 siblings, 0 replies; 29+ messages in thread From: Mike Galbraith @ 2017-06-30 3:11 UTC (permalink / raw) To: Josef Bacik, Joel Fernandes Cc: Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath On Thu, 2017-06-29 at 20:49 -0400, Josef Bacik wrote: > On Thu, Jun 29, 2017 at 05:19:14PM -0700, Joel Fernandes wrote: > > > Why are wanting the master's flip frequency to be higher than the > > slaves by the factor? > > (Responding from my personal email as my work email is outlook shit and > impossible to use) > > Because we are trying to detect the case that the master is waking many > different processes, and the 'slave' processes are only waking up the > master/some other specific processes to determine if we don't care about cache > locality. Yes, the heuristic (large delta implies waker/wakee are NOT 1:1, ie filter out high frequency communication where eating misses doesn't merely sting, it hurts like hell) just became bidirectional. > Actually I think both are wrong, but I need Mike to weigh in. My weigh in is this: if you have ideas to improve or replace that heuristic, by all means go for it, just make damn sure it's dirt cheap. Heuristics all suck one way or another, problem is that nasty old "perfect is the enemy of good" adage. Make it perfect, it'll hurt. -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: wake_wide mechanism clarification 2017-06-30 0:49 ` Josef Bacik 2017-06-30 3:04 ` Joel Fernandes 2017-06-30 3:11 ` Mike Galbraith @ 2017-06-30 13:11 ` Matt Fleming 2 siblings, 0 replies; 29+ messages in thread From: Matt Fleming @ 2017-06-30 13:11 UTC (permalink / raw) To: Josef Bacik Cc: Joel Fernandes, Mike Galbraith, Peter Zijlstra, LKML, Juri Lelli, Dietmar Eggemann, Patrick Bellasi, Brendan Jackman, Chris Redpath, Vincent Guittot On Thu, 29 Jun, at 08:49:13PM, Josef Bacik wrote: > > It may be worth to try with schedbench and trace it to see how this turns out in > practice, as that's the workload that generated all this discussion before. I > imagine generally speaking this works out properly. The small regression I > reported before was at low RPS, so we wouldn't be waking up as many tasks as > often, so we would be returning 0 from wake_wide() and we'd get screwed. This > is where I think possibly dropping the slave < factor part of the test would > address that, but I'd have to trace it to say for sure. Thanks, Just 2 weeks ago I was poking at wake_wide() because it's impacting hackbench times now we're better at balancing on fork() (see commit 6b94780e45c1 ("sched/core: Use load_avg for selecting idlest group")). What's happening is that occasionally the hackbench times will be pretty large because the hackbench tasks are being pulled back and forth across NUMA domains due to the wake_wide() logic. Reproducing this issue does require a NUMA box with more CPUs than hackbench tasks. I was using an 80-cpu 2 NUMA node box with 1 hackbench group (20 readers, 20 writers). I did the following very quick hack, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a1f5efa51dc7..c1bc1b0434bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5055,7 +5055,7 @@ static int wake_wide(struct task_struct *p) if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if (master < slave * factor) return 0; return 1; } Which produces the following results for the 1 group (40 tasks) on one of SUSE's enterprise kernels: hackbench-process-pipes 4.4.71 4.4.71 patched+patched+-wake-wide-fix Min 1 0.7000 ( 0.00%) 0.8480 (-21.14%) Amean 1 1.0343 ( 0.00%) 0.9073 ( 12.28%) Stddev 1 0.2373 ( 0.00%) 0.0447 ( 81.15%) CoeffVar 1 22.9447 ( 0.00%) 4.9300 ( 78.51%) Max 1 1.2270 ( 0.00%) 0.9560 ( 22.09%) You'll see that the minimum value is worse with my change, but the maximum is much better. So the current wake_wide() code does help sometimes, but it also hurts sometimes too. I'm happy to gather performance data for any code suggestions. ^ permalink raw reply related [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-08-10 17:41 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-06-30 0:19 wake_wide mechanism clarification Joel Fernandes 2017-06-30 0:49 ` Josef Bacik 2017-06-30 3:04 ` Joel Fernandes 2017-06-30 14:28 ` Josef Bacik 2017-06-30 17:02 ` Mike Galbraith 2017-06-30 17:55 ` Josef Bacik 2017-08-03 10:53 ` Brendan Jackman 2017-08-03 13:15 ` Josef Bacik 2017-08-03 15:05 ` Brendan Jackman 2017-08-09 21:22 ` Atish Patra 2017-08-10 9:48 ` Brendan Jackman 2017-08-10 17:41 ` Atish Patra 2017-07-29 8:01 ` Joel Fernandes 2017-07-29 8:13 ` Joel Fernandes 2017-08-02 8:26 ` Michael Wang 2017-08-03 23:48 ` Joel Fernandes 2017-07-29 15:07 ` Mike Galbraith 2017-07-29 20:19 ` Joel Fernandes 2017-07-29 22:28 ` Joel Fernandes 2017-07-29 22:41 ` Joel Fernandes 2017-07-31 12:21 ` Josef Bacik 2017-07-31 13:42 ` Mike Galbraith 2017-07-31 14:48 ` Josef Bacik 2017-07-31 17:23 ` Mike Galbraith 2017-07-31 16:21 ` Joel Fernandes 2017-07-31 16:42 ` Josef Bacik 2017-07-31 17:55 ` Joel Fernandes 2017-06-30 3:11 ` Mike Galbraith 2017-06-30 13:11 ` Matt Fleming
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.