From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B91EFC433ED for ; Wed, 28 Apr 2021 13:07:16 +0000 (UTC) Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id D1488613B8 for ; Wed, 28 Apr 2021 13:07:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D1488613B8 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding :Content-Type:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:Cc:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=bbMZSuTwGtkUj0oSYxCatO8LHTQkenSnFT246QDRb6Q=; b=qfnMljWkdi9PdZrt5n2GZ5dN6 Q742sFaJ2Maf1XW7BqBRA6pzOOUbUpX13njvtdbNQh2F3zOJ41HsVG5fnfPmn1Htvi36DBIw08Buz gIhL7R9E0tDBjbtXEGR4v8A4vmdMmLPYgjWu5FUgzrZlJ45RSnTrrel/LiUCyFQ9MDM317OKcpBKm cCXAhYlkcGgOWFEaRviMYm2f4tfv1pQJz13PE1M4M/DiEEZbvaEJXKt5ClLOe3zsLE/6V7+jVCInq i2muEWR2LvdjYolPMh2dv+0KWqEWJ6Zlwz1ESZ9TBHPtlJJGFykbrZYf8hINfrI+ijg+7IjB16Vtu KcL6s/l1g==; Received: from localhost ([::1] helo=desiato.infradead.org) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1lbjrn-003Vg1-O4; Wed, 28 Apr 2021 13:04:35 +0000 Received: from bombadil.infradead.org ([2607:7c80:54:e::133]) by desiato.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lbjrm-003Vfr-0q for linux-arm-kernel@desiato.infradead.org; Wed, 28 Apr 2021 13:04:34 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Content-Type:Cc:To:Subject:Message-ID :Date:From:In-Reply-To:References:MIME-Version:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=qShvs1kD6+QPIiHc1KiJImSSlzqxmF+Fov19AgqB3Uo=; b=0sKiqBKDmkc/Z+OIz+ZzomjMIg +IMNg5NvguC53aqYUU0nryaxE7xPucAdJZai8CUipRPcZinjtoO8t6vdb7ZeHR8xvuVIaYOyaRXup l9dUHJJ7/GLQdMU+FDmSgiwK+pUy32WrBsa6JalP80ABRzuHQEwDAK7jWopdZsKE4xNBLNQJpXrkZ uCNJr+UrI2K+G8kJwvLBrq4m+pql9iacMkCS3o7CYHiKK2w0xXi3+R9dY9iblMbl2KmiosWdy0K+r a4nQM1KdAZquqyD7EgBmOYFVwe614gWrClA/h2OewF0QAn4HZ+e/j/UMQ+aAGTdj9Vc5wywHjm/cT ZXnFSeZw==; Received: from mail-lj1-x234.google.com ([2a00:1450:4864:20::234]) by bombadil.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lbjri-00HRWZ-Jb for linux-arm-kernel@lists.infradead.org; Wed, 28 Apr 2021 13:04:32 +0000 Received: by mail-lj1-x234.google.com with SMTP id u25so33794367ljg.7 for ; Wed, 28 Apr 2021 06:04:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qShvs1kD6+QPIiHc1KiJImSSlzqxmF+Fov19AgqB3Uo=; b=COtPNeIbxICpTEWNszBu60eHVQY8hxmegldUYOkor4Zh3AiSkyOG5D4aoXtzgFaCjZ T1OhEj3UOJmpIVD9N+PL002daRjJU5X2qQNnWEIaNeZVflbs+rM45eWHHwGmKmWvgS+l 5Uefbyrtui0Ww7/fc2aMCC1oclbZE57snBtI/RkzGVT5qXR4dPuCL7f2aB76eUN2vTkf flspyXtZQRAn8wQjqAIwTUsG3R9OWN/QklxIjQsloUSXSY/OLGK8GDS2RYfbwLLs41m3 U26y1769brEr4VsvYs4ZSCgZzCH/ZoG/S9em1wNWHEI7mgAw8pZub10ntBVBjXYlJbl7 cAKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qShvs1kD6+QPIiHc1KiJImSSlzqxmF+Fov19AgqB3Uo=; b=QmlpKWn/vb7BUItrZfyy/dFMlYdU3JNI8KHkedLC86a+II/G+reQxTKyCgbB7ghCAi JGcVtxNv8bwNbniwmoyNgAyZkR2gYQ9ZGwcVEIuxtKMURST+JTolzZu/ZbE8WwH0ie07 GQyyGl4g7o+npEtg+i2VIHbrGkLjyLzVipTXcgJBxx+R+o/cLDvmXDhYdM5FXCQGLfKI gpZJdKAxZrr6mSfrRyds4RAi+Oivt73jpzsDRbvgq/2IlojH5UA6pvhT2E0K0T2KramJ o1HyB63hDRVRWMmWWND82rYYt/qn7kuxkayJYwM2lOllmyB9T+G3Aug0hxOtwgRq0nJS mEkw== X-Gm-Message-State: AOAM530cVtOdgRl047qhEp+OB5i/z9GYtoUMJQo5UlzNnL35kAZa+GLN MeGIquNOXRXOnv422nkX1KGPXG5H5kwjrChFkhQCzA== X-Google-Smtp-Source: ABdhPJwAy7ZFDxw6bJ3zZwNozxNo7A41VN+cjifIBrfQIOQJfk/bbriNS7UYASAR4GOZY8YDgx4Md2LH6dv4CZz9ia8= X-Received: by 2002:a05:651c:612:: with SMTP id k18mr20503096lje.445.1619615067105; Wed, 28 Apr 2021 06:04:27 -0700 (PDT) MIME-Version: 1.0 References: <20210420001844.9116-1-song.bao.hua@hisilicon.com> <20210420001844.9116-4-song.bao.hua@hisilicon.com> <80f489f9-8c88-95d8-8241-f0cfd2c2ac66@arm.com> In-Reply-To: From: Vincent Guittot Date: Wed, 28 Apr 2021 15:04:16 +0200 Message-ID: Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks within one LLC To: "Song Bao Hua (Barry Song)" Cc: Dietmar Eggemann , "tim.c.chen@linux.intel.com" , "catalin.marinas@arm.com" , "will@kernel.org" , "rjw@rjwysocki.net" , "bp@alien8.de" , "tglx@linutronix.de" , "mingo@redhat.com" , "lenb@kernel.org" , "peterz@infradead.org" , "rostedt@goodmis.org" , "bsegall@google.com" , "mgorman@suse.de" , "msys.mizuma@gmail.com" , "valentin.schneider@arm.com" , "gregkh@linuxfoundation.org" , Jonathan Cameron , "juri.lelli@redhat.com" , "mark.rutland@arm.com" , "sudeep.holla@arm.com" , "aubrey.li@linux.intel.com" , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "x86@kernel.org" , "xuwei (O)" , "Zengtao (B)" , "guodong.xu@linaro.org" , yangyicong , "Liguozhu (Kenneth)" , "linuxarm@openeuler.org" , "hpa@zytor.com" X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210428_060430_686986_9D740292 X-CRM114-Status: GOOD ( 50.32 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, 28 Apr 2021 at 11:51, Song Bao Hua (Barry Song) wrote: > > > > > -----Original Message----- > > From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com] > > Sent: Tuesday, April 27, 2021 11:36 PM > > To: Song Bao Hua (Barry Song) ; > > tim.c.chen@linux.intel.com; catalin.marinas@arm.com; will@kernel.org; > > rjw@rjwysocki.net; vincent.guittot@linaro.org; bp@alien8.de; > > tglx@linutronix.de; mingo@redhat.com; lenb@kernel.org; peterz@infradead.org; > > rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de > > Cc: msys.mizuma@gmail.com; valentin.schneider@arm.com; > > gregkh@linuxfoundation.org; Jonathan Cameron ; > > juri.lelli@redhat.com; mark.rutland@arm.com; sudeep.holla@arm.com; > > aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org; > > linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org; x86@kernel.org; > > xuwei (O) ; Zengtao (B) ; > > guodong.xu@linaro.org; yangyicong ; Liguozhu (Kenneth) > > ; linuxarm@openeuler.org; hpa@zytor.com > > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks > > within one LLC > > > > On 20/04/2021 02:18, Barry Song wrote: > > > > [...] > > > > > @@ -5786,11 +5786,12 @@ static void record_wakee(struct task_struct *p) > > > * whatever is irrelevant, spread criteria is apparent partner count exceeds > > > * socket size. > > > */ > > > -static int wake_wide(struct task_struct *p) > > > +static int wake_wide(struct task_struct *p, int cluster) > > > { > > > unsigned int master = current->wakee_flips; > > > unsigned int slave = p->wakee_flips; > > > - int factor = __this_cpu_read(sd_llc_size); > > > + int factor = cluster ? __this_cpu_read(sd_cluster_size) : > > > + __this_cpu_read(sd_llc_size); > > > > I don't see that the wake_wide() change has any effect here. None of the > > sched domains has SD_BALANCE_WAKE set so a wakeup (WF_TTWU) can never > > end up in the slow path. > > I am really confused. The whole code has only checked if wake_flags > has WF_TTWU, it has never checked if sd_domain has SD_BALANCE_WAKE flag. look at : #define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ so when wake_wide return false, we use the wake_affine mecanism but if it's false then we fllback to default mode which looks for: if (tmp->flags & sd_flag) This means looking for SD_BALANCE_WAKE which is never set so sd will stay NULL and you will end up calling select_idle_sibling anyway > > static int > select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > { > ... > > if (wake_flags & WF_TTWU) { > record_wakee(p); > > if (sched_energy_enabled()) { > new_cpu = find_energy_efficient_cpu(p, prev_cpu); > if (new_cpu >= 0) > return new_cpu; > new_cpu = prev_cpu; > } > > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > } > } > > And try_to_wake_up() has always set WF_TTWU: > static int > try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > { > cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); > ... > } > > So the change in wake_wide will actually affect the value of want_affine. > And I did also see code entered slow path during my benchmark. > > One issue I mentioned during linaro open discussion is that > since I have moved to use cluster size to decide the value > of wake_wide, relatively less tasks will make wake_wide() > decide to go to slow path, thus, tasks begin to spread to > other NUMA, but actually llc_size might be able to contain > those tasks. So a possible model might be: > static int wake_wide(struct task_struct *p) > { > tasksize < cluster : scan cluster > tasksize > llc : slow path > tasksize > cluster && tasksize < llc: scan llc > } > > thoughts? > > > Have you seen a diff when running your `lmbench stream` workload in what > > wake_wide() returns when you use `sd cluster size` instead of `sd llc > > size` as factor? > > > > I guess for you, wakeups are now subdivided into faster (cluster = 4 > > CPUs) and fast (llc = 24 CPUs) via sis(), not into fast (sis()) and slow > > (find_idlest_cpu()). > > > > > > > > if (master < slave) > > > swap(master, slave); > > > > [...] > > > > > @@ -6745,6 +6748,12 @@ static int find_energy_efficient_cpu(struct > > task_struct *p, int prev_cpu) > > > int want_affine = 0; > > > /* SD_flags and WF_flags share the first nibble */ > > > int sd_flag = wake_flags & 0xF; > > > + /* > > > + * if cpu and prev_cpu share LLC, consider cluster sibling rather > > > + * than llc. this is typically true while tasks are bound within > > > + * one numa > > > + */ > > > + int cluster = sched_cluster_active() && cpus_share_cache(cpu, prev_cpu, 0); > > > > So you changed from scanning cluster before LLC to scan either cluster > > or LLC. > > Yes, I have seen two ugly things for scanning cluster before scanning LLC > in select_idle_cpu. > 1. avg_scan_cost is actually measuring the scanning time. But if we scan > cluster before scanning LLC, during the gap of these two different > domains, we need a huge bit operation and this bit operation is not > a scanning operation at all. This makes the scan_cost quite > untrustworthy particularly "nr" can sometimes be < cluster size, sometimes > > cluster size. > > 2. select_idle_cpu() is actually the last step of wake_affine, before > that, wake_affine code has been totally depending on cpus_share_cache() > to decide the target to scan from. When waker and wakee have been already > in one LLC, if we only change select_idle_cpu(), at that time, decision > has been made. we may lose some chance to choose the right target to scan > from. So it should be more sensible to let cpus_share_cache() check cluster > when related tasks have been in one same LLC. > > > > > And this is based on whether `this_cpu` and `prev_cpu` are sharing LLC > > or not. So you only see an effect when running the workload with > > `numactl -N X ...`. > > Ideally, I'd expect this can also positively affect tasks located in > different LLCs. > For example, if taskA and taskB are in different NUMAs(also LLC for both > kunpeng920 and Tim's hardware) at the first beginning, a two-stage packing > might make them take the advantage of cluster: > For the first wake-up, taskA and taskB will be put in one LLC by scanning > LLC; > For the second wake-up, they might be put in one cluster by scanning > cluster. > So ideally, scanning LLC and scanning cluster can work in two stages > for related tasks and pack them step by step. Practically, this > could happen. But LB between NUMA might take the opposite way. Actually, > for a kernel completely *without* cluster patch, I have seen some > serious ping-pong of tasks in two numa nodes due to the conflict > of wake_affine and LB. this kind of ping-pong could seriously affect > the performance. > For example, for g=6,12,18,24,28,32, I have found running same workload > on 2numa shows so much worse latency than doing that on single one > numa(each numa has 24 cpus). > 1numa command: numactl -N 0 hackbench -p -T -l 1000000 -f 1 -g $1 > 2numa command: numactl -N 0-1 hackbench -p -T -l 1000000 -f 1 -g $1 > > Measured the average latency of 20 times for each command. > > *)without cluster scheduler, 2numa vs 1numa: > g = 6 12 18 24 28 32 > 1numa 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898 > 2numa 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419 > > BTW, my cluster patch actually also improves 2numa: > *)with cluster scheduler 2numa vs 1numa: > g = 6 12 18 24 28 32 > 1numa 0.9500 1.0728 1.1756 1.2201 1.4166 1.5464 > 2numa 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908 > > *) 2numa w/ and w/o cluster: > g = 6 12 18 24 28 32 > 2numa w/o 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419 > 2numa w/ 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908 > > Ideally, load balance should try to pull unmarried tasks rather than > married tasks. I mean, if we have > groupA: task1+task2 as couple, task3 as bachelor > groupB: task4. > groupB should try to pull task3. But I feel it is extremely hard to let > LB understand who is married and who is unmarried. > > I assume 2numa worse than 1numa should be a different topic > which might be worth more investigation. > > On the other hand, use cases I'm now targeting at are really using > "numactl -N x" to bind processes in one NUMA. If we ignore other NUMA > (also other LLCs) and think one NUMA as a whole system, cluster would > be the last-level topology scheduler can use. And the code could be > quite clean to directly leverage the existing select_sibling code for > LLC by simply changing cpus_share_cache() to cluster level. > > > > > > > > > if (wake_flags & WF_TTWU) { > > > record_wakee(p); > > > @@ -6756,7 +6765,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > new_cpu = prev_cpu; > > > } > > > > > > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > > > + want_affine = !wake_wide(p, cluster) && cpumask_test_cpu(cpu, > > p->cpus_ptr); > > > } > > > > > > rcu_read_lock(); > > > @@ -6768,7 +6777,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && > > > cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { > > > if (cpu != prev_cpu) > > > - new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); > > > + new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, cluster); > > > > > > sd = NULL; /* Prefer wake_affine over balance flags */ > > > break; > > > @@ -6785,7 +6794,7 @@ static int find_energy_efficient_cpu(struct task_struct > > *p, int prev_cpu) > > > new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); > > > } else if (wake_flags & WF_TTWU) { /* XXX always ? */ > > > /* Fast path */ > > > - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > > > + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, cluster); > > > > > > if (want_affine) > > > current->recent_used_cpu = cpu; > > > > [...] > > Thanks > Barry > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel