From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>,
Michal Hocko <mhocko@suse.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
David Hildenbrand <david@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>,
Mel Gorman <mgorman@suse.de>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
Christopher Lameter <cl@linux.com>,
linuxppc-dev@lists.ozlabs.org, Vlastimil Babka <vbabka@suse.cz>
Subject: [PATCH v5 0/3] Offline memoryless cpuless node 0
Date: Wed, 24 Jun 2020 14:58:43 +0530 [thread overview]
Message-ID: <20200624092846.9194-1-srikar@linux.vnet.ibm.com> (raw)
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u
Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u
Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u
Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#u
Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.
This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.
This patchset wants to correct this anomaly.
This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.
Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.
v5.8-rc2
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10
proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
v5.8-rc2 + patches
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10
proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.
2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump
3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.
On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.
Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973 <----------
numa_huge_pte_updates 4654 <----------
numa_hint_faults 19560 <----------
numa_hint_faults_local 19560 <----------
numa_pages_migrated 0
With patch
$ grep numa /proc/vmstat
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0 <----------
numa_huge_pte_updates 0 <----------
numa_hint_faults 0 <----------
numa_hint_faults_local 0 <----------
numa_pages_migrated 0
Here are 2 sample numa programs.
numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.
numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.
Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 149.62 149.66 149.64 0.02
./numa01.sh Sys: 3.21 3.71 3.46 0.25
./numa01.sh User: 4755.13 4758.15 4756.64 1.51
./numa02.sh Real: 24.98 25.02 25.00 0.02
./numa02.sh Sys: 0.51 0.59 0.55 0.04
./numa02.sh User: 790.28 790.88 790.58 0.30
Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133%
./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5%
./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873%
./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641%
./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667%
./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151%
numa01.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 1131164 0 -100%
numa_hint_faults_local 1131164 0 -100%
numa_hit 213696 214244 0.256439%
numa_local 213696 214244 0.256439%
numa_pte_updates 1131294 0 -100%
pgfault 1380845 241424 -82.5162%
pgmajfault 75 60 -20%
Here are 2 sample numa programs.
numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.
numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.
Without patch
-------------
Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 149.62 149.66 149.64 0.02
./numa01.sh Sys: 3.21 3.71 3.46 0.25
./numa01.sh User: 4755.13 4758.15 4756.64 1.51
./numa02.sh Real: 24.98 25.02 25.00 0.02
./numa02.sh Sys: 0.51 0.59 0.55 0.04
./numa02.sh User: 790.28 790.88 790.58 0.30
With patch
-----------
Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133%
./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5%
./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873%
./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641%
./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667%
./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151%
numa01.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 1131164 0 -100%
numa_hint_faults_local 1131164 0 -100%
numa_hit 213696 214244 0.256439%
numa_local 213696 214244 0.256439%
numa_pte_updates 1131294 0 -100%
pgfault 1380845 241424 -82.5162%
pgmajfault 75 60 -20%
numa02.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 111878 0 -100%
numa_hint_faults_local 111878 0 -100%
numa_hit 41854 43220 3.26373%
numa_local 41854 43220 3.26373%
numa_pte_updates 113926 0 -100%
pgfault 163662 51210 -68.7099%
pgmajfault 56 52 -7.14286%
Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Srikar Dronamraju (3):
powerpc/numa: Set numa_node for all possible cpus
powerpc/numa: Prefer node id queried from vphn
mm/page_alloc: Keep memoryless cpuless node 0 offline
arch/powerpc/mm/numa.c | 35 +++++++++++++++++++++++++----------
mm/page_alloc.c | 4 +++-
2 files changed, 28 insertions(+), 11 deletions(-)
--
2.18.1
next reply other threads:[~2020-06-24 9:31 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-24 9:28 Srikar Dronamraju [this message]
2020-06-24 9:28 ` [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-06-24 9:48 ` Gautham R Shenoy
2020-06-24 9:28 ` [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-06-24 10:29 ` Gautham R Shenoy
2020-06-24 9:28 ` [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-06-29 14:58 ` Christopher Lameter
2020-06-30 4:01 ` Srikar Dronamraju
2020-07-01 12:23 ` Michal Hocko
2020-07-01 8:42 ` Michal Hocko
2020-07-01 10:04 ` Srikar Dronamraju
2020-07-01 10:15 ` David Hildenbrand
2020-07-01 11:01 ` Srikar Dronamraju
2020-07-01 11:06 ` David Hildenbrand
2020-07-01 11:30 ` David Hildenbrand
2020-07-01 12:21 ` Michal Hocko
2020-07-02 6:44 ` Srikar Dronamraju
2020-07-02 8:41 ` Michal Hocko
2020-07-02 14:32 ` Srikar Dronamraju
2020-07-03 9:10 ` Michal Suchánek
2020-07-03 9:24 ` Michal Hocko
2020-07-03 10:59 ` Michal Hocko
2020-07-03 11:32 ` David Hildenbrand
2020-07-03 11:46 ` Michal Hocko
2020-07-03 12:58 ` Srikar Dronamraju
2020-08-07 4:32 ` Andrew Morton
2020-08-07 6:58 ` David Hildenbrand
2020-08-07 10:04 ` Michal Suchánek
2020-08-12 6:01 ` Srikar Dronamraju
2020-08-18 7:32 ` David Hildenbrand
2020-08-18 7:37 ` Michal Hocko
2020-08-18 7:49 ` Srikar Dronamraju
2020-07-06 16:08 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200624092846.9194-1-srikar@linux.vnet.ibm.com \
--to=srikar@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=david@redhat.com \
--cc=ego@linux.vnet.ibm.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=sathnaga@linux.vnet.ibm.com \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).