From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66ED9C433F5 for ; Sat, 7 May 2022 00:54:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1445118AbiEGA63 (ORCPT ); Fri, 6 May 2022 20:58:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbiEGA61 (ORCPT ); Fri, 6 May 2022 20:58:27 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C688850B2F for ; Fri, 6 May 2022 17:54:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651884882; x=1683420882; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZHUKXnpEJXH5qUcfBqfm3azdXKYtQy/eAhnKUMnduLY=; b=D4vXIRHJyChe2PG7uGsmDvKofnGIn7Cchg6ODKFtquC4RJTHU8dMl0MU /SYfLbNpIvA3Hmy4fECU1+g6WM09fmwm2eJn5oaPaWZl0rEqqO6PplNnJ jKbTi1HX7gSbW/jDPIFc5R53aAhXzPYPgraVR35yjCQBa7nFzooNKqwXJ sWaO12jUcKH6qH1K7bBjkKp1eJsXF1aTiOAJ/oTyXxnVzGNOmq09pDjd3 eqiyFuS037XPV8utufAVnjl/It3L/uukCYsuJ2PjLS/2BPnzKGcg7jpMQ UrdkRMpE8jnqPymrm2hN1QK4uokORKDuvfgu836zHhYAU/3BQilKGAvUW A==; X-IronPort-AV: E=McAfee;i="6400,9594,10339"; a="267456940" X-IronPort-AV: E=Sophos;i="5.91,205,1647327600"; d="scan'208";a="267456940" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2022 17:54:42 -0700 X-IronPort-AV: E=Sophos;i="5.91,205,1647327600"; d="scan'208";a="586288097" Received: from yuzhenta-mobl.ccr.corp.intel.com ([10.254.213.210]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2022 17:54:38 -0700 Message-ID: <7d20a9543f69523cfda280e3f5ab17d68db037ab.camel@intel.com> Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression From: "ying.huang@intel.com" To: Aaron Lu Cc: Mel Gorman , kernel test robot , Linus Torvalds , Vlastimil Babka , Dave Hansen , Jesper Dangaard Brouer , Michal Hocko , Andrew Morton , LKML , lkp@lists.01.org, lkp@intel.com, feng.tang@intel.com, zhengjun.xing@linux.intel.com, fengwei.yin@intel.com Date: Sat, 07 May 2022 08:54:35 +0800 In-Reply-To: References: <20220420013526.GB14333@xsang-OptiPlex-9020> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote: > On Fri, May 06, 2022 at 04:40:45PM +0800, ying.huang@intel.com wrote: > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote: > > > Hi Mel, > > > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote: > > > > > > > > (please be noted we reported > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression" > > > > on > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/ > > > > while the commit is on branch. > > > > now we still observe similar regression when it's on mainline, and we also > > > > observe a 13.2% improvement on another netperf subtest. > > > > so report again for information) > > > > > > > > Greeting, > > > > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit: > > > > > > > > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free") > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > > > > > > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0) > > > > IMHO, this means the consumer and producer are running on different > > CPUs. > > > > Right. > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER, > > > then do not use PCP but directly free the page directly to buddy. > > > > > > The rationale as explained in the commit's changelog is: > > > " > > > Netperf running on localhost exhibits this pattern and while it does not > > > matter for some machines, it does matter for others with smaller caches > > > where cache misses cause problems due to reduced page reuse. Pages > > > freed directly to the buddy list may be reused quickly while still cache > > > hot where as storing on the PCP lists may be cold by the time > > > free_pcppages_bulk() is called. > > > " > > > > > > This regression occurred on a machine that has large caches so this > > > optimization brings no value to it but only overhead(skipped PCP), I > > > guess this is the reason why there is a regression. > > > > Per my understanding, not only the cache size is larger, but also the L2 > > cache (1MB) is per-core on this machine. So if the consumer and > > producer are running on different cores, the cache-hot page may cause > > more core-to-core cache transfer. This may hurt performance too. > > > > Client side allocates skb(page) and server side recvfrom() it. > recvfrom() copies the page data to server's own buffer and then releases > the page associated with the skb. Client does all the allocation and > server does all the free, page reuse happens at client side. > So I think core-2-core cache transfer due to page reuse can occur when > client task migrates. The core-to-core cache transfering can be cross-socket or cross-L2 in one socket. I mean the later one. > I have modified the job to have the client and server bound to a > specific CPU of different cores on the same node, and testing it on the > same Icelake 2 sockets server, the result is > >   kernel throughput > 8b10b465d0e1 125168 > f26b3fa04611 102039 -18% > > It's also a 18% drop. I think this means c2c is not a factor? Can you test with client and server bound to 2 hardware threads (hyperthread) of one core? The two hardware threads of one core will share the L2 cache. > > > I have also tested this case on a small machine: a skylake desktop and > > > this commit shows improvement: > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76, > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6% > > > > > > So this means those directly freed pages get reused by allocator side > > > and that brings performance improvement for machines with smaller cache. > > > > Per my understanding, the L2 cache on this desktop machine is shared > > among cores. > > > > The said CPU is i7-6700 and according to this wikipedia page, > L2 is per core: > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors Sorry, my memory was wrong. The skylake and later server has much larger private L2 cache (1MB vs 256KB of client), this may increase the possibility of core-2-core transfering. > > > I wonder if we should still use PCP a little bit under the above said > > > condition, for the purpose of: > > > 1 reduced overhead in the free path for machines with large cache; > > > 2 still keeps the benefit of reused pages for machines with smaller cache. > > > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to > > > either returning pcp->batch or (pcp->batch << 2): > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2) > > > skylake desktop: 72288 90784 92219 91528 > > > icelake 2sockets: 120956 99177 98251 116108 > > > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's > > > parent, returns 0 is the behaviour of this commit. > > > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2) > > > for the described condition, then this workload's performance on > > > small machine can remain while the regression on large machines can be > > > greately reduced(from -18% to -4%). > > > > > > > Can we use cache size and topology information directly? > > It can be complicated by the fact that the system can have multiple > producers(cpus that are doing free) running at the same time and getting > the perfect number can be a difficult job. We can discuss this after verifying whether it's core-2-core transfering related. Best Regards, Huang, Ying