On Mon, Apr 10, 2017 at 02:43:20PM +0200, Borislav Petkov wrote: >On Sun, Apr 09, 2017 at 11:12:14AM +0800, Wei Yang wrote: >> Oops, sorry to bring in the regression with my cleanup. >> I haven't noticed there is a kernel command line "numa=fake", which >> is the cause of the crash I think. > >Of course it is, didn't you see my debugging upthread? > >> So from my understanding, I am goting to do these tests: >> >> 1. all fake numa scenarios with Kirill's qemu command line > >It is enough if you boot the kernel with "numa=fake..." > >> 2. Real numa scenarios with following qemu command option > >Not qemu command option but a kernel cmdline option. > >> 3. Baremetal >> >> One more question, on the baremetal mathine, I can't change the >> numa configuration, so there would be only one case. Do you have >> some specific requirement? > >numa=fake on baremetal too. > >> Well, if I missed something, just let me know :-) >> >> > Qemu can emulate real numa too, for example you can boot with: >> > >> > -smp 64 \ >> > -numa node,nodeid=0,cpus=1-8 \ >> > -numa node,nodeid=1,cpus=9-16 \ >> > -numa node,nodeid=2,cpus=17-24 \ >> > -numa node,nodeid=3,cpus=25-32 \ >> > -numa node,nodeid=4,cpus=0 \ >> > -numa node,nodeid=4,cpus=33-39 \ >> > -numa node,nodeid=5,cpus=40-47 \ >> > -numa node,nodeid=6,cpus=48-55 \ >> > -numa node,nodeid=7,cpus=56-63 > >Also, do this in kvm. kvm can emulate a lot of numa configurations, do >experiment with those too. > >Basically, try to break your "cleanup". Stuff one should do for every >patch one sends anyway. Hi, Borislav I have tried several test combinations of the fake numa. The result shows good. The test result marked as P (Passed), means the system boots up and simple kernel build test succeed. # test matrix and result ## Qemu With qemu, I have tried [phys_node, emu_node] = [(1, 4), (0, 2, 4, 8)] +----------------+--------+--------+ | phys_node | 1 | 4 | |emu_node | | | +----------------+--------+--------+ | 0 | P | P | +----------------+--------+--------+ | 2 | P | P | +----------------+--------+--------+ | 4 | P | P | +----------------+--------+--------+ | 8 | P | P | +----------------+--------+--------+ phys_node is emulated with qemu command line: "-numa node,nodeid=0,cpus=1-2 -numa node,nodeid=1,cpus=3-4 -numa node,nodeid=2,cpus=0 -numa node,nodeid=2,cpus=5 -numa node,nodeid=3,cpus=6-7" emu_node is emulated with kernel command line: "numa=fake=N" ## Baremetal On my machine, it only has one numa node, so I could just verify phys_node with 1. +----------------+--------+ | phys_node | 1 | |emu_node | | +----------------+--------+ | 0 | P | +----------------+--------+ | 2 | P | +----------------+--------+ | 4 | P | +----------------+--------+ | 8 | P | +----------------+--------+ emu_node is emulated with kernel command line: "numa=fake=N" # Other things I observed Generally, in qemu guest, every thing looks good, while there are two things I saw in baremetal machine. At first I want to emphasize, I saw the same behavior with/without my "cleanup". ## only 3 node when fake=4 [ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000022f5fffff] [ 0.000000] Faking node 0 at [mem 0x0000000000000000-0x000000007fffffff] (2048MB) [ 0.000000] Faking node 1 at [mem 0x0000000080000000-0x0000000133ffffff] (2880MB) [ 0.000000] Faking node 2 at [mem 0x0000000134000000-0x000000022f5fffff] (4022MB) [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.000000] node 0: [mem 0x0000000000100000-0x000000007fffffff] [ 0.000000] node 1: [mem 0x0000000080000000-0x00000000ba5b1fff] [ 0.000000] node 1: [mem 0x00000000ba5b9000-0x00000000bad8dfff] [ 0.000000] node 1: [mem 0x00000000bafb6000-0x00000000ca8a1fff] [ 0.000000] node 1: [mem 0x00000000ca93a000-0x00000000ca977fff] [ 0.000000] node 1: [mem 0x00000000cafff000-0x00000000caffffff] [ 0.000000] node 1: [mem 0x0000000100000000-0x0000000133ffffff] [ 0.000000] node 2: [mem 0x0000000134000000-0x000000022f5fffff] ## some warning I don't see these two warnings without "numa=fake=N". [ 0.004000] sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. [ 0.004000] ------------[ cut here ]------------ [ 0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:424 topology_sane.isra.5+0x6c/0x70 [ 8.594469] sysfs: cannot create duplicate filename '/devices/platform/coretemp.0/hwmon/hwmon2/temp2_label' [ 8.594478] ------------[ cut here ]------------ [ 8.594482] WARNING: CPU: 4 PID: 34 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70 # Some thoughts on the code After went throught the numa_emulation(), I suggest to restructure the numa_nodes_parsed based on the emulated nodes, instead of set numa_nodes_parsed directly in emu_setup_memblk(). Two cases in my mind, which are not friendly: 1. split_nodes_size_interleave/split_nodes_interleave() may fail or the following procedure may fail. 2. fake node may be less than physcial nodes Both of them may leads to a inaccurate numa_nodes_parsed. So I have a patch to restructure it from emulated node info. Will send it soon. > >-- >Regards/Gruss, > Boris. > >Good mailing practices for 400: avoid top-posting and trim the reply. -- Wei Yang Help you, Help me