From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751622AbdLLILc (ORCPT ); Tue, 12 Dec 2017 03:11:32 -0500 Received: from mx2.suse.de ([195.135.220.15]:41224 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750731AbdLLIL3 (ORCPT ); Tue, 12 Dec 2017 03:11:29 -0500 Date: Tue, 12 Dec 2017 09:11:26 +0100 From: Michal Hocko To: kemi Cc: Greg Kroah-Hartman , Andrew Morton , Vlastimil Babka , Mel Gorman , Johannes Weiner , Christopher Lameter , YASUAKI ISHIMATSU , Andrey Ryabinin , Nikolay Borisov , Pavel Tatashin , David Rientjes , Sebastian Andrzej Siewior , Dave , Andi Kleen , Tim Chen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Aubrey Li , Linux MM , Linux Kernel Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Message-ID: <20171212081126.GK4779@dhcp22.suse.cz> References: <1511848824-18709-1-git-send-email-kemi.wang@intel.com> <20171129121740.f6drkbktc43l5ib6@dhcp22.suse.cz> <4b840074-cb5f-3c10-d65b-916bc02fb1ee@intel.com> <20171130085322.tyys6xbzzvui7ogz@dhcp22.suse.cz> <0f039a89-5500-1bf5-c013-d39ba3bf62bd@intel.com> <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz> <9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com> <20171208084755.GS20234@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 12-12-17 10:05:26, kemi wrote: > > > On 2017年12月08日 16:47, Michal Hocko wrote: > > On Fri 08-12-17 16:38:46, kemi wrote: > >> > >> > >> On 2017年11月30日 17:45, Michal Hocko wrote: > >>> On Thu 30-11-17 17:32:08, kemi wrote: > >> > >> After thinking about how to optimize our per-node stats more gracefully, > >> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus, > >> we can keep everything in per cpu counter and sum them up when read /proc > >> or /sys for numa stats. > >> What's your idea for that? thanks > > > > I would like to see a strong argument why we cannot make it a _standard_ > > node counter. > > > > all right. > This issue is first reported and discussed in 2017 MM summit, referred to > the topic "Provoking and fixing memory bottlenecks -Focused on the page > allocator presentation" presented by Jesper. > > http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit > 2017-JesperBrouer.pdf (slide 15/16) > > As you know, page allocator is too slow and has becomes a bottleneck > in high-speed network. > Jesper also showed some data in that presentation: with micro benchmark > stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost > (143->97) comes from CONFIG_NUMA. > > When I took a look at this issue, I reproduced this issue and got a > similar result to Jesper's. Furthermore, with the help from Jesper, > the overhead is root caused and the real cause of this overhead comes > from an extra level of function calls such as zone_statistics() (*10%*, > nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(), > policy_nodemask and etc (perf profiling cpu cycles). zone_statistics() > is the biggest one introduced by CONFIG_NUMA in fast path that we can > do something for optimizing page allocator. Plus, the overhead of > zone_statistics() significantly increase with more and more cpu > cores and nodes due to cache bouncing. > > Therefore, we submitted a patch before to mitigate the overhead of > zone_statistics() by reducing global NUMA counter update frequency > (enlarge threshold size, as suggested by Dave Hansen). I also would > like to have an implementation of a "_standard_node counter" for NUMA > stats, but I wonder how we can keep the performance gain at the > same time. I understand all that. But we do have a way to put all that overhead away by disabling the stats altogether. I presume that CPU cycle sensitive workloads would simply use that option because the stats are quite limited in their usefulness anyway IMHO. So we are back to: Do normal workloads care all that much to have 3rd way to account for events? I haven't heard a sound argument for that. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id B821B6B0033 for ; Tue, 12 Dec 2017 03:11:30 -0500 (EST) Received: by mail-pf0-f198.google.com with SMTP id w7so17204076pfd.4 for ; Tue, 12 Dec 2017 00:11:30 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p33si11336193pld.545.2017.12.12.00.11.28 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 12 Dec 2017 00:11:29 -0800 (PST) Date: Tue, 12 Dec 2017 09:11:26 +0100 From: Michal Hocko Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Message-ID: <20171212081126.GK4779@dhcp22.suse.cz> References: <1511848824-18709-1-git-send-email-kemi.wang@intel.com> <20171129121740.f6drkbktc43l5ib6@dhcp22.suse.cz> <4b840074-cb5f-3c10-d65b-916bc02fb1ee@intel.com> <20171130085322.tyys6xbzzvui7ogz@dhcp22.suse.cz> <0f039a89-5500-1bf5-c013-d39ba3bf62bd@intel.com> <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz> <9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com> <20171208084755.GS20234@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: kemi Cc: Greg Kroah-Hartman , Andrew Morton , Vlastimil Babka , Mel Gorman , Johannes Weiner , Christopher Lameter , YASUAKI ISHIMATSU , Andrey Ryabinin , Nikolay Borisov , Pavel Tatashin , David Rientjes , Sebastian Andrzej Siewior , Dave , Andi Kleen , Tim Chen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Aubrey Li , Linux MM , Linux Kernel On Tue 12-12-17 10:05:26, kemi wrote: > > > On 2017a1'12ae??08ae?JPY 16:47, Michal Hocko wrote: > > On Fri 08-12-17 16:38:46, kemi wrote: > >> > >> > >> On 2017a1'11ae??30ae?JPY 17:45, Michal Hocko wrote: > >>> On Thu 30-11-17 17:32:08, kemi wrote: > >> > >> After thinking about how to optimize our per-node stats more gracefully, > >> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus, > >> we can keep everything in per cpu counter and sum them up when read /proc > >> or /sys for numa stats. > >> What's your idea for that? thanks > > > > I would like to see a strong argument why we cannot make it a _standard_ > > node counter. > > > > all right. > This issue is first reported and discussed in 2017 MM summit, referred to > the topic "Provoking and fixing memory bottlenecks -Focused on the page > allocator presentation" presented by Jesper. > > http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit > 2017-JesperBrouer.pdf (slide 15/16) > > As you know, page allocator is too slow and has becomes a bottleneck > in high-speed network. > Jesper also showed some data in that presentation: with micro benchmark > stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost > (143->97) comes from CONFIG_NUMA. > > When I took a look at this issue, I reproduced this issue and got a > similar result to Jesper's. Furthermore, with the help from Jesper, > the overhead is root caused and the real cause of this overhead comes > from an extra level of function calls such as zone_statistics() (*10%*, > nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(), > policy_nodemask and etc (perf profiling cpu cycles). zone_statistics() > is the biggest one introduced by CONFIG_NUMA in fast path that we can > do something for optimizing page allocator. Plus, the overhead of > zone_statistics() significantly increase with more and more cpu > cores and nodes due to cache bouncing. > > Therefore, we submitted a patch before to mitigate the overhead of > zone_statistics() by reducing global NUMA counter update frequency > (enlarge threshold size, as suggested by Dave Hansen). I also would > like to have an implementation of a "_standard_node counter" for NUMA > stats, but I wonder how we can keep the performance gain at the > same time. I understand all that. But we do have a way to put all that overhead away by disabling the stats altogether. I presume that CPU cycle sensitive workloads would simply use that option because the stats are quite limited in their usefulness anyway IMHO. So we are back to: Do normal workloads care all that much to have 3rd way to account for events? I haven't heard a sound argument for that. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org