From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1AECC4320E for ; Thu, 2 Sep 2021 13:39:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CE44261056 for ; Thu, 2 Sep 2021 13:39:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345315AbhIBNkb (ORCPT ); Thu, 2 Sep 2021 09:40:31 -0400 Received: from mga02.intel.com ([134.134.136.20]:44394 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345306AbhIBNk3 (ORCPT ); Thu, 2 Sep 2021 09:40:29 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10094"; a="206335089" X-IronPort-AV: E=Sophos;i="5.84,372,1620716400"; d="scan'208";a="206335089" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Sep 2021 06:39:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,372,1620716400"; d="scan'208";a="542702720" Received: from shbuild999.sh.intel.com (HELO localhost) ([10.239.146.151]) by fmsmga002.fm.intel.com with ESMTP; 02 Sep 2021 06:39:24 -0700 Date: Thu, 2 Sep 2021 21:39:24 +0800 From: Feng Tang To: Michal Koutn?? Cc: Andi Kleen , Johannes Weiner , Linus Torvalds , andi.kleen@intel.com, kernel test robot , Roman Gushchin , Michal Hocko , Shakeel Butt , Balbir Singh , Tejun Heo , Andrew Morton , LKML , lkp@lists.01.org, kernel test robot , "Huang, Ying" , Zhengjun Xing Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression Message-ID: <20210902133924.GA72811@shbuild999.sh.intel.com> References: <20210818023004.GA17956@shbuild999.sh.intel.com> <20210831063036.GA46357@shbuild999.sh.intel.com> <20210831092304.GA17119@blackbody.suse.cz> <20210901045032.GA21937@shbuild999.sh.intel.com> <877dg0wcrr.fsf@linux.intel.com> <20210902013558.GA97410@shbuild999.sh.intel.com> <20210902034628.GA76472@shbuild999.sh.intel.com> <20210902105306.GC17119@blackbody.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210902105306.GC17119@blackbody.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 02, 2021 at 12:53:06PM +0200, Michal Koutn?? wrote: > Hi. > > On Thu, Sep 02, 2021 at 11:46:28AM +0800, Feng Tang wrote: > > > Narrowing it down to a single prefetcher seems good enough to me. The > > > behavior of the prefetchers is fairly complicated and hard to predict, so I > > > doubt you'll ever get a 100% step by step explanation. > > My layman explanation with the available information is that the > prefetcher somehow behaves as if it marked the offending cacheline as > modified (even though reading only) therefore slowing down the remote reader. But this can't explain the test that adding 128 bytes before css->cgroup can restore/improve the performance. > On Thu, Sep 02, 2021 at 09:35:58AM +0800, Feng Tang wrote: > > @@ -139,10 +139,21 @@ struct cgroup_subsys_state { > > /* PI: the cgroup that this css is attached to */ > > struct cgroup *cgroup; > > > > + struct cgroup_subsys_state *parent; > > + > > /* PI: the cgroup subsystem that this css is attached to */ > > struct cgroup_subsys *ss; > > Hm, an interesting move; be mindful of commit b8b1a2e5eca6 ("cgroup: > move cgroup_subsys_state parent field for cache locality"). It might be > a regression for systems with cpuacct root css present. (That is likely > a big amount nowadays, that may be the reason why you don't see full > recovery? For future, we may at least guard cpuacct_charge() with > cgroup_subsys_enabled() static branch.) Goot catch! Acutally I also tested only moving 'destroy_work' and 'destroy_rwork' ('parent' is not touched with the cost of 8 bytes more padding), which has simliar effect that restore to about 15% regression. > > [snip] > > Yes, I'm afriad so, given that the policy/algorithm used by perfetcher > > keeps changing from generation to generation. > > Exactly. I'm afraid of relayouting the structure with each new > generation. A robust solution is putting all frequently accessed members > into individual cache-lines + separating them with one more cache line? :-/ Yes, this is hard. Even for my debug patch, we can only say it works as restoring the regression partly, but not knowing the exact reason. Thansk, Feng > > Michal From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============5311200495523172369==" MIME-Version: 1.0 From: Feng Tang To: lkp@lists.01.org Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression Date: Thu, 02 Sep 2021 21:39:24 +0800 Message-ID: <20210902133924.GA72811@shbuild999.sh.intel.com> In-Reply-To: <20210902105306.GC17119@blackbody.suse.cz> List-Id: --===============5311200495523172369== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, Sep 02, 2021 at 12:53:06PM +0200, Michal Koutn?? wrote: > Hi. > = > On Thu, Sep 02, 2021 at 11:46:28AM +0800, Feng Tang wrote: > > > Narrowing it down to a single prefetcher seems good enough to me. The > > > behavior of the prefetchers is fairly complicated and hard to predict= , so I > > > doubt you'll ever get a 100% step by step explanation. > = > My layman explanation with the available information is that the > prefetcher somehow behaves as if it marked the offending cacheline as > modified (even though reading only) therefore slowing down the remote rea= der. But this can't explain the test that adding 128 bytes before css->cgroup can restore/improve the performance. = > On Thu, Sep 02, 2021 at 09:35:58AM +0800, Feng Tang wrote: > > @@ -139,10 +139,21 @@ struct cgroup_subsys_state { > > /* PI: the cgroup that this css is attached to */ > > struct cgroup *cgroup; > > > > + struct cgroup_subsys_state *parent; > > + > > /* PI: the cgroup subsystem that this css is attached to */ > > struct cgroup_subsys *ss; > = > Hm, an interesting move; be mindful of commit b8b1a2e5eca6 ("cgroup: > move cgroup_subsys_state parent field for cache locality"). It might be > a regression for systems with cpuacct root css present. (That is likely > a big amount nowadays, that may be the reason why you don't see full > recovery? For future, we may at least guard cpuacct_charge() with > cgroup_subsys_enabled() static branch.) Goot catch! = Acutally I also tested only moving 'destroy_work' and 'destroy_rwork' ('parent' is not touched with the cost of 8 bytes more padding), which has simliar effect that restore to about 15% regression. = > > [snip] > > Yes, I'm afriad so, given that the policy/algorithm used by perfetcher > > keeps changing from generation to generation. > = > Exactly. I'm afraid of relayouting the structure with each new > generation. A robust solution is putting all frequently accessed members > into individual cache-lines + separating them with one more cache line? := -/ Yes, this is hard. Even for my debug patch, we can only say it works as restoring the regression partly, but not knowing the exact reason. Thansk, Feng > = > Michal --===============5311200495523172369==--