linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Mel Gorman <mgorman@suse.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Christopher Lameter <cl@linux.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
Date: Fri, 8 May 2020 18:33:04 +0530	[thread overview]
Message-ID: <20200508130304.GA1961@linux.vnet.ibm.com> (raw)
In-Reply-To: <20200504093712.GL22838@dhcp22.suse.cz>

* Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]:

> > > 
> > > Have you tested on something else than ppc? Each arch does the NUMA
> > > setup separately and this is a big mess. E.g. x86 marks even memory less
> > > nodes (see init_memory_less_node) as online.
> > > 
> > 
> > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> > enabled/disabled on both single node and multi node machines.
> > However, I dont have a cpuless/memoryless x86 system.
> 
> This should be able to emulate inside kvm, I believe.
> 

I did try but somehow not able to get cpuless / memoryless node in a x86 kvm
guest.

Also I am unable to see how to enable HAVE_MEMORYLESS_NODES on x86 system.
# git grep -w HAVE_MEMORYLESS_NODES | cat
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES
#
I forced enabled but it got disabled while kernel build.
May be I am missing something.

> > 
> > So we have a redundant page hinting numa faults which we can avoid.
> 
> interesting. Does this lead to any observable differences? Btw. it would
> be really great to describe how the online state influences the numa
> balancing.
> 

If numa_balancing is enabled, it has a check to see if the number of online
nodes is 1. If its one, it disables numa_balancing, else the numa_balancing
stays as is. In this case, the actual node (node nr > 0) and
node 0 were marked online without the patch.

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

numa02.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        111878      0           -100%
numa_hint_faults_local  111878      0           -100%
numa_hit                41854       43220       3.26373%
numa_local              41854       43220       3.26373%
numa_pte_updates        113926      0           -100%
pgfault                 163662      51210       -68.7099%
pgmajfault              56          52          -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

> > 2. Few people have complained about existence of this dummy node when
> > parsing lscpu and numactl o/p. They somehow start to think that the tools
> > are reporting incorrectly or the kernel is not able to recognize resources
> > connected to the node.
> 
> Please be more specific.

Taking the below example of numactl
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 7 size: 16238 MB
node 7 free: 15449 MB
node distances:
node   0   7 
  0:  10  20 
  7:  20  10 

We know node 0 can be special, but users may not feel the same.

When users parse numactl/lscpu or /sys directory; they find there are 2
online nodes. They find none of the resources for a node(node 0) are
available but still online. However they find other nodes (nodes 1-6) with
don't have resources but not online. So they tend to think the kernel has
been unable to online some of the resources or the resources have gone bad.
Please do note that on hypervisors like PowerVM, the admins don't have
control over which nodes the resources are allocated.

-- 
Thanks and Regards
Srikar Dronamraju

  reply	other threads:[~2020-05-08 13:03 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-28  9:38 [PATCH v2 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-04-29  6:52   ` Gautham R Shenoy
2020-04-30  4:34     ` Srikar Dronamraju
2020-04-28  9:38 ` [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-04-28 23:59   ` Andrew Morton
2020-04-29  1:41     ` Srikar Dronamraju
2020-04-29 12:22       ` Michal Hocko
2020-04-30  7:18         ` Srikar Dronamraju
2020-05-04  9:37           ` Michal Hocko
2020-05-08 13:03             ` Srikar Dronamraju [this message]
2020-05-08 13:39               ` David Hildenbrand
2020-05-08 13:42                 ` David Hildenbrand
2020-05-11 17:47                   ` Srikar Dronamraju
2020-05-12  7:49                     ` David Hildenbrand
2020-05-12 10:42                       ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200508130304.GA1961@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).