Re: [Patch V2 2/2] x86/mm/numa: remove the numa_nodemask_from_meminfo()

From: Wei Yang <richard.weiyang@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Wei Yang <richard.weiyang@gmail.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Tejun Heo <tj@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [Patch V2 2/2] x86/mm/numa: remove the numa_nodemask_from_meminfo()
Date: Tue, 11 Apr 2017 00:39:14 +0800	[thread overview]
Message-ID: <20170410163914.GA4404@WeideMacBook-Pro.local> (raw)
In-Reply-To: <20170410124320.fq5sw4lt2imztiyl@pd.tnic>

[-- Attachment #1: Type: text/plain, Size: 5881 bytes --]

On Mon, Apr 10, 2017 at 02:43:20PM +0200, Borislav Petkov wrote:
>On Sun, Apr 09, 2017 at 11:12:14AM +0800, Wei Yang wrote:
>> Oops, sorry to bring in the regression with my cleanup.
>> I haven't noticed there is a kernel command line "numa=fake", which
>> is the cause of the crash I think.
>
>Of course it is, didn't you see my debugging upthread?
>
>> So from my understanding, I am goting to do these tests:
>> 
>> 1. all fake numa scenarios with Kirill's qemu command line
>
>It is enough if you boot the kernel with "numa=fake..."
>
>> 2. Real numa scenarios with following qemu command option
>
>Not qemu command option but a kernel cmdline option.
>
>> 3. Baremetal
>> 
>> One more question, on the baremetal mathine, I can't change the
>> numa configuration, so there would be only one case. Do you have
>> some specific requirement?
>
>numa=fake on baremetal too.
>
>> Well, if I missed something, just let me know :-)
>> 
>> > Qemu can emulate real numa too, for example you can boot with:
>> >
>> > -smp 64 \
>> > -numa node,nodeid=0,cpus=1-8 \
>> > -numa node,nodeid=1,cpus=9-16 \
>> > -numa node,nodeid=2,cpus=17-24 \
>> > -numa node,nodeid=3,cpus=25-32 \
>> > -numa node,nodeid=4,cpus=0 \
>> > -numa node,nodeid=4,cpus=33-39 \
>> > -numa node,nodeid=5,cpus=40-47 \
>> > -numa node,nodeid=6,cpus=48-55 \
>> > -numa node,nodeid=7,cpus=56-63
>
>Also, do this in kvm. kvm can emulate a lot of numa configurations, do
>experiment with those too.
>
>Basically, try to break your "cleanup". Stuff one should do for every
>patch one sends anyway.

Hi, Borislav

I have tried several test combinations of the fake numa. The result shows good.

The test result marked as P (Passed), means the system boots up and simple
kernel build test succeed.

# test matrix and result

## Qemu

With qemu, I have tried [phys_node, emu_node] = [(1, 4), (0, 2, 4, 8)]

  +----------------+--------+--------+
  |      phys_node |   1    |   4    |
  |emu_node        |        |        |
  +----------------+--------+--------+
  |        0       |   P    |   P    |
  +----------------+--------+--------+
  |        2       |   P    |   P    |
  +----------------+--------+--------+
  |        4       |   P    |   P    |
  +----------------+--------+--------+
  |        8       |   P    |   P    |
  +----------------+--------+--------+

phys_node is emulated with qemu command line:

    "-numa node,nodeid=0,cpus=1-2 -numa node,nodeid=1,cpus=3-4 -numa
    node,nodeid=2,cpus=0 -numa node,nodeid=2,cpus=5 -numa
    node,nodeid=3,cpus=6-7"

emu_node is emulated with kernel command line:

    "numa=fake=N"

## Baremetal

On my machine, it only has one numa node, so I could just verify phys_node
with 1.

  +----------------+--------+
  |      phys_node |   1    |
  |emu_node        |        |
  +----------------+--------+
  |        0       |   P    |
  +----------------+--------+
  |        2       |   P    |
  +----------------+--------+
  |        4       |   P    |
  +----------------+--------+
  |        8       |   P    |
  +----------------+--------+

emu_node is emulated with kernel command line:

    "numa=fake=N"

# Other things I observed

Generally, in qemu guest, every thing looks good, while there are two things I
saw in baremetal machine.

At first I want to emphasize, I saw the same behavior with/without my
"cleanup".

## only 3 node when fake=4

[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000022f5fffff]
[    0.000000] Faking node 0 at [mem 0x0000000000000000-0x000000007fffffff]
(2048MB)
[    0.000000] Faking node 1 at [mem 0x0000000080000000-0x0000000133ffffff]
(2880MB)
[    0.000000] Faking node 2 at [mem 0x0000000134000000-0x000000022f5fffff]
(4022MB)
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x000000007fffffff]
[    0.000000]   node   1: [mem 0x0000000080000000-0x00000000ba5b1fff]
[    0.000000]   node   1: [mem 0x00000000ba5b9000-0x00000000bad8dfff]
[    0.000000]   node   1: [mem 0x00000000bafb6000-0x00000000ca8a1fff]
[    0.000000]   node   1: [mem 0x00000000ca93a000-0x00000000ca977fff]
[    0.000000]   node   1: [mem 0x00000000cafff000-0x00000000caffffff]
[    0.000000]   node   1: [mem 0x0000000100000000-0x0000000133ffffff]
[    0.000000]   node   2: [mem 0x0000000134000000-0x000000022f5fffff]

## some warning

I don't see these two warnings without "numa=fake=N".

[    0.004000] sched: CPU #1's llc-sibling CPU #0 is not on the same node!  [node: 1 != 0]. Ignoring dependency.
[    0.004000] ------------[ cut here ]------------
[    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:424 topology_sane.isra.5+0x6c/0x70

[    8.594469] sysfs: cannot create duplicate filename '/devices/platform/coretemp.0/hwmon/hwmon2/temp2_label'
[    8.594478] ------------[ cut here ]------------
[    8.594482] WARNING: CPU: 4 PID: 34 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70

# Some thoughts on the code

After went throught the numa_emulation(), I suggest to restructure the
numa_nodes_parsed based on the emulated nodes, instead of set
numa_nodes_parsed directly in emu_setup_memblk().

Two cases in my mind, which are not friendly:
1. split_nodes_size_interleave/split_nodes_interleave() may fail or the
following procedure may fail.
2. fake node may be less than physcial nodes

Both of them may leads to a inaccurate numa_nodes_parsed. So I have a patch to
restructure it from emulated node info.

Will send it soon.

>
>-- 
>Regards/Gruss,
>    Boris.
>
>Good mailing practices for 400: avoid top-posting and trim the reply.

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]