linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Sandipan Das <sandipan@linux.ibm.com>
To: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	khlebnikov@yandex-team.ru, kirill@shutemov.name,
	aneesh.kumar@linux.ibm.com, srikar@linux.vnet.ibm.com
Subject: Re: [PATCH] mm: vmstat: Use zeroed stats for unpopulated zones
Date: Thu, 7 May 2020 14:35:15 +0530	[thread overview]
Message-ID: <c08821ea-aea6-c902-b8e1-0fdf69492528@linux.ibm.com> (raw)
In-Reply-To: <20200507070924.GE6345@dhcp22.suse.cz>


On 07/05/20 12:39 pm, Michal Hocko wrote:
> On Wed 06-05-20 17:50:28, Vlastimil Babka wrote:
>> [...]
>>
>> How about something like this:
>>
>> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
>> index aaf1667489f8..08ec2c2bdce3 100644
>> --- a/Documentation/admin-guide/numastat.rst
>> +++ b/Documentation/admin-guide/numastat.rst
>> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>>  
>>  All units are pages. Hugepages have separate counters.
>>  
>> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
>> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
>> +is incremented on the preferred node, otherwise numa_foreign is incremented on
>> +the preferred node and numa_miss on the node where allocation succeeded.
>> +
>> +Usually preferred node is the one local to the CPU where the process executes,
>> +but restrictions such as mempolicies can change that, so there are also two
>> +counters based on CPU local node. local_node is similar to numa_hit and is
>> +incremented on allocation from a node by CPU on the same node. other_node is
>> +similar to numa_miss and is incremented on the node where allocation succeeds
>> +from a CPU from a different node. Note there is no counter analogical to
>> +numa_foreign.
>> +
>> +In more detail:
>> +
>>  =============== ============================================================
>>  numa_hit	A process wanted to allocate memory from this node,
>>  		and succeeded.
>> @@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
>>  		but ended up with memory from this node.
>>  
>>  numa_foreign	A process wanted to allocate on this node,
>> -		but ended up with memory from another one.
>> +		but ended up with memory from another node.
>>  
>> -local_node	A process ran on this node and got memory from it.
>> +local_node	A process ran on this node's CPU,
>> +		and got memory from this node.
>>  
>> -other_node	A process ran on this node and got memory from another node.
>> +other_node	A process ran on a different node's CPU
>> +		and got memory from this node.
>>  
>>  interleave_hit 	Interleaving wanted to allocate from this node
>>  		and succeeded.
>> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
>>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>>  well right now on machines with a small number of CPUs.
>>  
>> +Note that on systems with memoryless nodes (where a node has CPUs but no
>> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
>> +heavily. In the current kernel implementation, if a process prefers a
>> +memoryless node (i.e.  because it is running on one of its local CPU), the
>> +implementation actually treats one of the nearest nodes with memory as the
>> +preferred node. As a result, such allocation will not increase the numa_foreign
>> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
>> +numa_foreign statistics of the nearest node.
> 
> This is certainly an improvement. Thanks! The question whether we can
> identify where bogus numbers came from would be interesting as well.
> Maybe those are not worth fixing but it would be great to understand
> them at least. I have to say that the explanation via boot_pageset is
> not really clear to me.
> 

The documentation update will definitely help. Thanks for that.
I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
in case of zones that are using the boot_pageset and most of them originate
from kmem_cache_init() with eventual calls to allocate_slab().

[    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
[    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
[    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
[    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
[    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
[    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
...

In the remaining cases, the sources are ftrace_init() and early_trace_init().

Unless they are useful, can we avoid incrementing stats for zones using boot_pageset
inside zone_statistics()?


- Sandipan


  reply	other threads:[~2020-05-07  9:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-04  7:03 [PATCH] mm: vmstat: Use zeroed stats for unpopulated zones Sandipan Das
2020-05-04 10:26 ` Michal Hocko
2020-05-06 13:33   ` Vlastimil Babka
2020-05-06 14:02     ` Michal Hocko
2020-05-06 15:09       ` Vlastimil Babka
2020-05-06 15:24         ` Michal Hocko
2020-05-06 15:50           ` Vlastimil Babka
2020-05-07  7:09             ` Michal Hocko
2020-05-07  9:05               ` Sandipan Das [this message]
2020-05-07  9:08                 ` Sandipan Das
2020-05-07 11:07                   ` Vlastimil Babka
2020-05-07 11:17                     ` Sandipan Das

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c08821ea-aea6-c902-b8e1-0fdf69492528@linux.ibm.com \
    --to=sandipan@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=kirill@shutemov.name \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).