From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80184C43461 for ; Mon, 7 Sep 2020 12:48:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 35102215A4 for ; Mon, 7 Sep 2020 12:48:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729315AbgIGMso (ORCPT ); Mon, 7 Sep 2020 08:48:44 -0400 Received: from outbound-smtp14.blacknight.com ([46.22.139.231]:45739 "EHLO outbound-smtp14.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729247AbgIGMmk (ORCPT ); Mon, 7 Sep 2020 08:42:40 -0400 Received: from mail.blacknight.com (pemlinmail02.blacknight.ie [81.17.254.11]) by outbound-smtp14.blacknight.com (Postfix) with ESMTPS id A8BCA1C3F14 for ; Mon, 7 Sep 2020 13:42:19 +0100 (IST) Received: (qmail 10544 invoked from network); 7 Sep 2020 12:42:19 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.21.127]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 7 Sep 2020 12:42:19 -0000 Date: Mon, 7 Sep 2020 13:42:16 +0100 From: Mel Gorman To: "Song Bao Hua (Barry Song)" Cc: Mel Gorman , "mingo@redhat.com" , "peterz@infradead.org" , "juri.lelli@redhat.com" , "vincent.guittot@linaro.org" , "dietmar.eggemann@arm.com" , "bsegall@google.com" , "linux-kernel@vger.kernel.org" , Peter Zijlstra , Valentin Schneider , Phil Auld , Hillf Danton , Ingo Molnar , Linuxarm , "Liguozhu (Kenneth)" Subject: Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load? Message-ID: <20200907124216.GE3179@techsingularity.net> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 07, 2020 at 12:00:10PM +0000, Song Bao Hua (Barry Song) wrote: > Hi All, > In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle. > Then we start a process with 4 threads in this totally idle system. > Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that. > But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes. > This results in two negative side effects: > 1. more numa nodes are awaken while they can save power in lowest frequency and halt status > 2. cache coherency overhead between numa nodes > > A proof-of-concept patch I made to "fix" this issue to some extent is like: > This is similar in concept to a patch that did something similar except in adjust_numa_imbalance(). It ended up being great for light loads like simple communicating pairs but fell apart for some HPC workloads when memory bandwidth requirements increased. Ultimately it was dropped until the NUMA/CPU load balancing was reconciled so may be worth a revisit. At the time, it was really problematic once a one node was roughly 25% CPU utilised on a 2-socket machine with hyper-threading enabled. The patch may still work out but it would need wider testing. Within mmtests, the NAS workloads for D-class on a 2-socket machine varying the number of parallel tasks/processes are used should be enough to determine if the patch is free from side-effects for one machine. It gets problematic for different machine sizes as the point where memory bandwidth is saturated varies. group_weight/4 might be fine on one machine as a cut-off and a problem on a larger machine with more cores -- I hit that particular problem when one 2 socket machine with 48 logical CPUs was fine but a different machine with 80 logical CPUs regressed. I'm not saying the patch is wrong, just that patches in general for this area (everyone, not just you) need fairly broad testing. -- Mel Gorman SUSE Labs