From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751507AbcGMPxs (ORCPT ); Wed, 13 Jul 2016 11:53:48 -0400 Received: from foss.arm.com ([217.140.101.70]:37631 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751005AbcGMPxj (ORCPT ); Wed, 13 Jul 2016 11:53:39 -0400 Date: Wed, 13 Jul 2016 16:54:35 +0100 From: Morten Rasmussen To: Vincent Guittot Cc: Peter Zijlstra , "mingo@redhat.com" , Dietmar Eggemann , Yuyang Du , mgalbraith@suse.de, linux-kernel Subject: Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Message-ID: <20160713155434.GA21816@e105550-lin.cambridge.arm.com> References: <1466615004-3503-1-git-send-email-morten.rasmussen@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 13, 2016 at 02:06:17PM +0200, Vincent Guittot wrote: > Hi Morten, > > On 22 June 2016 at 19:03, Morten Rasmussen wrote: > > Hi, > > > > The scheduler is currently not doing much to help performance on systems with > > asymmetric compute capacities (read ARM big.LITTLE). This series improves the > > situation with a few tweaks mainly to the task wake-up path that considers > > compute capacity at wake-up and not just whether a cpu is idle for these > > systems. This gives us consistent, and potentially higher, throughput in > > partially utilized scenarios. SMP behaviour and performance should be > > unaffected. > > > > Test 0: > > for i in `seq 1 10`; \ > > do sysbench --test=cpu --max-time=3 --num-threads=1 run; \ > > done \ > > | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \ > > END {print "Average events: " sum/runs}' > > > > Target: ARM TC2 (2xA15+3xA7) > > > > (Higher is better) > > tip: Average events: 146.9 > > patch: Average events: 217.9 > > > > Test 1: > > perf stat --null --repeat 10 -- \ > > perf bench sched messaging -g 50 -l 5000 > > > > Target: Intel IVB-EP (2*10*2) > > > > tip: 4.861970420 seconds time elapsed ( +- 1.39% ) > > patch: 4.886204224 seconds time elapsed ( +- 0.75% ) > > > > Target: ARM TC2 A7-only (3xA7) (-l 1000) > > > > tip: 61.485682596 seconds time elapsed ( +- 0.07% ) > > patch: 62.667950130 seconds time elapsed ( +- 0.36% ) > > > > More analysis: > > > > Statistics from mixed periodic task workload (rt-app) containing both > > big and little task, single run on ARM TC2: > > > > tu = Task utilization big/little > > pcpu = Previous cpu big/little > > tcpu = This (waker) cpu big/little > > dl = New cpu is little > > db = New cpu is big > > sis = New cpu chosen by select_idle_sibling() > > figc = New cpu chosen by find_idlest_*() > > ww = wake_wide(task) count for figc wakeups > > bw = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake) > > for figc wakeups > > > > case tu pcpu tcpu dl db sis figc ww bw > > 1 l l l 122 68 28 162 161 161 > > 2 l l b 11 4 0 15 15 15 > > 3 l b l 0 252 8 244 244 244 > > 4 l b b 36 1928 711 1253 1016 1016 > > 5 b l l 5 19 0 24 22 24 > > 6 b l b 5 1 0 6 0 6 > > 7 b b l 0 31 0 31 31 31 > > 8 b b b 1 194 109 86 59 59 > > -------------------------------------------------- > > 180 2497 856 1821 > > I'm not sure to know how to interpret all these statistics Thanks for looking into the details. Let me provide a bit more context. After our discussion around v1 I wanted to understand how the patches works with different combinations of task utilization, prev_cpu, and waking cpu. IIRC, the outcome our discussion was that tasks with utilization too high to fit little cpus should go on big cpus, tasks small enough to fit anywhere can go anywhere. For the latter we don't want to spend too much time placing them as they essentially don't care so they can be placed using select_idle_sibling(). So, I created a workload with rt-app with a number of different periodic tasks with different periods and busy times. I traced all wake-ups and put them into eight categories depending on the wake-up scenario, i.e. task utilization, prev_cpu, and waking cpu (tu, pcpu, and tcpu). The next two columns (dl, db) show the number of wake-ups that ended up on a little or big cpu. If we take case 1 as an example, we had 190 wake-ups in total where a little task last ran on a little cpu and was woken up by a little cpu. 122 of the wake-ups ended up selecting a little cpu again, while in 68 cases it went to a big cpu. That is fine according to our scheduling policy above. The sis and figc columns show the split between wake-ups being handled by select_idle_sibling() versus find_idlest_*(). Coming back to case 1, 28 wake-ups where handled by the former and 162 by the latter. We can't really say exactly how many of select_idle_sibling() wake-ups that ended up on big or little, but based on the numbers it is clear that in some cases find_idlest_*() chose a big cpu for a little task (68 > 28). The last two columns, ww and bw, try to explain why we have so many wake-ups handled by find_idlest_*() for cases (1-4, and 8) where we could have used select_idle_sibling(). The bw number is the number of find_idlest_*() wake-ups that were passed the SD_BALANCE_WAKE flag (i.e. non-FORK and non-EXEC wake-ups). FORK and EXEC wake-ups always take the find_idlest_*() route, so we should ignore those. For case 1 it turned out that only one of the 162 figc wake-ups was a FORK/EXEC wake-up. So something else mush have caused those wake-ups to not go via select_idle_sibling(). The ww column explains why as it shows how many of the figc wake-ups where wake_wide() returned true and therefore disabled want_affine. Because we have enabled SD_BALANCE_WAKE on the sched_domains, !wake_affine wake-ups no longer end up using select_idle_sibling() anyway, but end up using find_idlest_*(). Thinking more about it, should we force those tasks to use select_idle_sibling() anyway? Something like the below could do it I think: @@ -5444,7 +5444,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f new_cpu = cpu; } - if (!sd) { + if (!sd || (!wake_cap(p, cpu, prev_cpu) && sd_flags & SD_BALANCE_WAKE)) { if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); Ideally, cases 5-7 should be handled by find_idlest_*(), which seems to hold true in the table above, and cases 1-4, and 8 should be handled by select_idle_sibling(), which isn't always the case due to wake_wide(). Thoughts?