Hello Thanks to comments and suggestions from past versions, I am posting version 6 of vNUMA patches. Wei, after sending I realized I did not add you to cc, my apologies. Please send your comments and suggestions. Thank you Elena On Fri, Jul 18, 2014 at 1:49 AM, Elena Ufimtseva wrote: > vNUMA introduction > > This series of patches introduces vNUMA topology awareness and > provides interfaces and data structures to enable vNUMA for > PV guests. There is a plan to extend this support for dom0 and > HVM domains. > > vNUMA topology support should be supported by PV guest kernel. > Corresponding patches should be applied. > > Introduction > ------------- > > vNUMA topology is exposed to the PV guest to improve performance when > running > workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA > machines and thus having virtual NUMA topology visible to guests. > XEN vNUMA implementation provides a way to run vNUMA-enabled guests on > NUMA/UMA > and flexibly map vNUMA topology to physical NUMA topology. > > Mapping to physical NUMA topology may be done in manual and automatic way. > By default, every PV domain has one vNUMA node. It is populated by default > parameters and does not affect performance. To use automatic way of > initializing > vNUMA topology, configuration file need only to have number of vNUMA nodes > defined. Not-defined vNUMA topology parameters will be initialized to > default > ones. > > vNUMA topology is currently defined as a set of parameters such as: > number of vNUMA nodes; > distance table; > vnodes memory sizes; > vcpus to vnodes mapping; > vnode to pnode map (for NUMA machines). > > This set of patches introduces two hypercall subops: > XEN_DOMCTL_setvnumainfo > and XENMEM_get_vnuma_info. > > XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain > vNUMA topology with user defined configuration or the parameters by > default. > vNUMA is defined for every PV domain and if no vNUMA configuration found, > one vNUMA node is initialized and all cpus are assigned to it. All other > parameters set to their default values. > > XENMEM_gevnumainfo is used by the PV domain to get the information > from hypervisor about vNUMA topology. Guest sends its memory sizes > allocated > for different vNUMA parameters and hypervisor fills it with topology. > Future work to use this in HVM guests in the toolstack is required and > in the hypervisor to allow HVM guests to use these hypercalls. > > libxl > > libxl allows us to define vNUMA topology in configuration file and > verifies that > configuration is correct. libxl also verifies mapping of vnodes to pnodes > and > uses it in case of NUMA-machine and if automatic placement was disabled. > In case > of incorrect/insufficient configuration, one vNUMA node will be initialized > and populated with default values. > > libxc > > libxc builds the vnodes memory addresses for guest and makes necessary > alignments to the addresses. It also takes into account guest e820 memory > map > configuration. The domain memory is allocated and vnode to pnode mapping > is used to determine target node for particular vnode. If this mapping was > not > defined, it is not a NUMA machine or automatic NUMA placement is enabled, > the > default not node-specific allocation will be used. > > hypervisor vNUMA initialization > > PV guest > > As of now, only PV guest can take advantage of vNUMA functionality. > Such guest allocates the memory for NUMA topology, sets number of nodes and > cpus so hypervisor has information about how much memory guest has > preallocated for vNUMA topology. Further guest makes subop hypercall > XENMEM_getvnumainfo. > If for some reason vNUMA topology cannot be initialized, Linux guest > will have only one NUMA node initialized (standard Linux behavior). > To enable this, vNUMA Linux patches should be applied and vNUMA supporting > patches should be applied to PV kernel. > > Linux kernel patch is available here: > https://git.gitorious.org/vnuma/linux_vnuma.git > git://gitorious.org/vnuma/linux_vnuma.git > > Automatic vNUMA placement > > vNUMA automatic placement will be enabled if numa automatic placement is > not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If > vnode to pnode mapping is correct and automatic NUMA placement disabled, > vNUMA nodes will be allocated on nodes as it was specified in the guest > config file. > > Xen patchset is available here: > https://git.gitorious.org/vnuma/xen_vnuma.git > git://gitorious.org/vnuma/xen_vnuma.git > > > Examples of booting vNUMA enabled PV Linux guest on real NUMA machine: > > memory = 4000 > vcpus = 2 > # The name of the domain, change this if you want more than 1 VM. > name = "null" > vnodes = 2 > #vnumamem = [3000, 1000] > #vnumamem = [4000,0] > vdistance = [10, 20] > vnuma_vcpumap = [1, 0] > vnuma_vnodemap = [1] > vnuma_autoplacement = 0 > #e820_host = 1 > > [ 0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version > 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014 > [ 0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug > loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all > LOGLEVEL=8 earlyprintk=xen sched_debug > [ 0.000000] ACPI in unprivileged domain disabled > [ 0.000000] e820: BIOS-provided physical RAM map: > [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable > [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved > [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable > [ 0.000000] bootconsole [xenboot0] enabled > [ 0.000000] NX (Execute Disable) protection: active > [ 0.000000] DMI not present or invalid. > [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved > [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable > [ 0.000000] No AGP bridge found > [ 0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000 > [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size > 24576 > [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] > [ 0.000000] [mem 0x00000000-0x000fffff] page 4k > [ 0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff] > [ 0.000000] [mem 0xf9e00000-0xf9ffffff] page 4k > [ 0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE > [ 0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE > [ 0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff] > [ 0.000000] [mem 0xf8000000-0xf9dfffff] page 4k > [ 0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE > [ 0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE > [ 0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE > [ 0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE > [ 0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff] > [ 0.000000] [mem 0x80000000-0xf7ffffff] page 4k > [ 0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff] > [ 0.000000] [mem 0x00100000-0x7fffffff] page 4k > [ 0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff] > [ 0.000000] Nodes received = 2 > [ 0.000000] NUMA: Initialized distance table, cnt=2 > [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff] > [ 0.000000] NODE_DATA [mem 0x7cfd9000-0x7cffffff] > [ 0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff] > [ 0.000000] NODE_DATA [mem 0xf9828000-0xf984efff] > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x00001000-0x00ffffff] > [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] > [ 0.000000] Normal empty > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] > [ 0.000000] node 0: [mem 0x00100000-0x7cffffff] > [ 0.000000] node 1: [mem 0x7d000000-0xf9ffffff] > [ 0.000000] On node 0 totalpages: 511903 > [ 0.000000] DMA zone: 64 pages used for memmap > [ 0.000000] DMA zone: 21 pages reserved > [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 > [ 0.000000] DMA32 zone: 7936 pages used for memmap > [ 0.000000] DMA32 zone: 507904 pages, LIFO batch:31 > [ 0.000000] On node 1 totalpages: 512000 > [ 0.000000] DMA32 zone: 8000 pages used for memmap > [ 0.000000] DMA32 zone: 512000 pages, LIFO batch:31 > [ 0.000000] SFI: Simple Firmware Interface v0.81 > http://simplefirmware.org > [ 0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs > [ 0.000000] nr_irqs_gsi: 16 > [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] > [ 0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices > [ 0.000000] Booting paravirtualized kernel on Xen > [ 0.000000] Xen version: 4.5-unstable (preserve-AD) > [ 0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2 > nr_node_ids:2 > [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888 > r8192 d20608 u2097152 > [ 0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152 > [ 0.000000] pcpu-alloc: [0] 0 [1] 1 > [ 0.000000] xen: PV spinlocks enabled > [ 0.000000] Built 2 zonelists in Node order, mobility grouping on. > Total pages: 1007882 > [ 0.000000] Policy zone: DMA32 > [ 0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen > debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all > LOGLEVEL=8 earlyprintk=xen sched_debug > [ 0.000000] Memory: 3978224K/4095612K available (4022K kernel code, > 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved) > [ 0.000000] Enabling automatic NUMA balancing. Configure with > numa_balancing= or the kernel.numa_balancing sysctl > [ 0.000000] installing Xen timer for CPU 0 > [ 0.000000] tsc: Detected 2394.276 MHz processor > [ 0.004000] Calibrating delay loop (skipped), value calculated using > timer frequency.. 4788.55 BogoMIPS (lpj=9577104) > [ 0.004000] pid_max: default: 32768 minimum: 301 > [ 0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304 > bytes) > [ 0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152 > bytes) > [ 0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes) > [ 0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 > bytes) > [ 0.007935] CPU: Physical Processor ID: 0 > [ 0.007942] CPU: Processor Core ID: 0 > [ 0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 > [ 0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 > [ 0.007951] tlb_flushall_shift: 6 > [ 0.021249] cpu 0 spinlock event irq 17 > [ 0.021292] Performance Events: unsupported p6 CPU model 45 no PMU > driver, software events only. > [ 0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled > [ 0.022625] installing Xen timer for CPU 1 > > root@heatpipe:~# numactl --ha > available: 2 nodes (0-1) > node 0 cpus: 0 > node 0 size: 1933 MB > node 0 free: 1894 MB > node 1 cpus: 1 > node 1 size: 1951 MB > node 1 free: 1926 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > root@heatpipe:~# numastat > node0 node1 > numa_hit 52257 92679 > numa_miss 0 0 > numa_foreign 0 0 > interleave_hit 4254 4238 > local_node 52150 87364 > other_node 107 5315 > > root@superpipe:~# xl debug-keys u > > (XEN) Domain 7 (total: 1024000): > (XEN) Node 0: 1024000 > (XEN) Node 1: 0 > (XEN) Domain has 2 vnodes, 2 vcpus > (XEN) vnode 0 - pnode 0, 2000 MB, vcpu nums: 0 > (XEN) vnode 1 - pnode 0, 2000 MB, vcpu nums: 1 > > > memory = 4000 > vcpus = 8 > # The name of the domain, change this if you want more than 1 VM. > name = "null1" > vnodes = 8 > #vnumamem = [3000, 1000] > vdistance = [10, 40] > #vnuma_vcpumap = [1, 0, 3, 2] > vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1] > vnuma_autoplacement = 1 > e820_host = 1 > > [ 0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed > [ 0.000000] 1-1 mapping on ac228->100000 > [ 0.000000] Released 318936 pages of unused memory > [ 0.000000] Set 343512 page(s) to 1-1 mapping > [ 0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added > [ 0.000000] e820: BIOS-provided physical RAM map: > [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable > [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved > [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable > [ 0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved > [ 0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable > [ 0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved > [ 0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable > [ 0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved > [ 0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable > [ 0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved > [ 0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable > [ 0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved > [ 0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable > [ 0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS > [ 0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable > [ 0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data > [ 0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable > [ 0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data > [ 0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable > [ 0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved > [ 0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved > [ 0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved > [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved > [ 0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved > [ 0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable > [ 0.000000] NX (Execute Disable) protection: active > [ 0.000000] DMI not present or invalid. > [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved > [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable > [ 0.000000] No AGP bridge found > [ 0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000 > [ 0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000 > [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size > 24576 > [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] > [ 0.000000] [mem 0x00000000-0x000fffff] page 4k > [ 0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff] > [ 0.000000] [mem 0x14da00000-0x14dbfffff] page 4k > [ 0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE > [ 0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE > [ 0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff] > [ 0.000000] [mem 0x14c000000-0x14d9fffff] page 4k > [ 0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE > [ 0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE > [ 0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE > [ 0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE > [ 0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff] > [ 0.000000] [mem 0x100000000-0x14bffffff] page 4k > [ 0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff] > [ 0.000000] [mem 0x00100000-0xac227fff] page 4k > [ 0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff] > [ 0.000000] [mem 0x14dc00000-0x14ddd7fff] page 4k > [ 0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff] > [ 0.000000] Nodes received = 8 > [ 0.000000] NUMA: Initialized distance table, cnt=8 > [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff] > [ 0.000000] NODE_DATA [mem 0x1f3d9000-0x1f3fffff] > [ 0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff] > [ 0.000000] NODE_DATA [mem 0x3e7d9000-0x3e7fffff] > [ 0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff] > [ 0.000000] NODE_DATA [mem 0x5dbd9000-0x5dbfffff] > [ 0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff] > [ 0.000000] NODE_DATA [mem 0x7cfd9000-0x7cffffff] > [ 0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff] > [ 0.000000] NODE_DATA [mem 0x9c3d9000-0x9c3fffff] > [ 0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff] > [ 0.000000] NODE_DATA [mem 0x10f5b1000-0x10f5d7fff] > [ 0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff] > [ 0.000000] NODE_DATA [mem 0x12e9b1000-0x12e9d7fff] > [ 0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff] > [ 0.000000] NODE_DATA [mem 0x14ddad000-0x14ddd3fff] > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x00001000-0x00ffffff] > [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] > [ 0.000000] Normal [mem 0x100000000-0x14ddd7fff] > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] > [ 0.000000] node 0: [mem 0x00100000-0x1f3fffff] > [ 0.000000] node 1: [mem 0x1f400000-0x3e7fffff] > [ 0.000000] node 2: [mem 0x3e800000-0x5dbfffff] > [ 0.000000] node 3: [mem 0x5dc00000-0x7cffffff] > [ 0.000000] node 4: [mem 0x7d000000-0x9c3fffff] > [ 0.000000] node 5: [mem 0x9c400000-0xac227fff] > [ 0.000000] node 5: [mem 0x100000000-0x10f5d7fff] > [ 0.000000] node 6: [mem 0x10f5d8000-0x12e9d7fff] > [ 0.000000] node 7: [mem 0x12e9d8000-0x14ddd7fff] > [ 0.000000] On node 0 totalpages: 127903 > [ 0.000000] DMA zone: 64 pages used for memmap > [ 0.000000] DMA zone: 21 pages reserved > [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 > [ 0.000000] DMA32 zone: 1936 pages used for memmap > [ 0.000000] DMA32 zone: 123904 pages, LIFO batch:31 > [ 0.000000] On node 1 totalpages: 128000 > [ 0.000000] DMA32 zone: 2000 pages used for memmap > [ 0.000000] DMA32 zone: 128000 pages, LIFO batch:31 > [ 0.000000] On node 2 totalpages: 128000 > [ 0.000000] DMA32 zone: 2000 pages used for memmap > [ 0.000000] DMA32 zone: 128000 pages, LIFO batch:31 > [ 0.000000] On node 3 totalpages: 128000 > [ 0.000000] DMA32 zone: 2000 pages used for memmap > [ 0.000000] DMA32 zone: 128000 pages, LIFO batch:31 > [ 0.000000] On node 4 totalpages: 128000 > [ 0.000000] DMA32 zone: 2000 pages used for memmap > [ 0.000000] DMA32 zone: 128000 pages, LIFO batch:31 > [ 0.000000] On node 5 totalpages: 128000 > [ 0.000000] DMA32 zone: 1017 pages used for memmap > [ 0.000000] DMA32 zone: 65064 pages, LIFO batch:15 > [ 0.000000] Normal zone: 984 pages used for memmap > [ 0.000000] Normal zone: 62936 pages, LIFO batch:15 > [ 0.000000] On node 6 totalpages: 128000 > [ 0.000000] Normal zone: 2000 pages used for memmap > [ 0.000000] Normal zone: 128000 pages, LIFO batch:31 > [ 0.000000] On node 7 totalpages: 128000 > [ 0.000000] Normal zone: 2000 pages used for memmap > [ 0.000000] Normal zone: 128000 pages, LIFO batch:31 > [ 0.000000] SFI: Simple Firmware Interface v0.81 > http://simplefirmware.org > [ 0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs > [ 0.000000] nr_irqs_gsi: 16 > [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff] > [ 0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff] > [ 0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff] > [ 0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices > [ 0.000000] Booting paravirtualized kernel on Xen > [ 0.000000] Xen version: 4.5-unstable (preserve-AD) > [ 0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8 > nr_node_ids:8 > [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888 > r8192 d20608 u2097152 > [ 0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152 > [ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7 > [ 0.000000] xen: PV spinlocks enabled > [ 0.000000] Built 8 zonelists in Node order, mobility grouping on. > Total pages: 1007881 > [ 0.000000] Policy zone: Normal > [ 0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug > kgdboc=hvc0 nokgdbroundup initcall_debug debug > [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) > [ 0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340 > [ 0.000000] Checking aperture... > [ 0.000000] No AGP bridge found > [ 0.000000] Memory: 3976748K/4095612K available (4022K kernel code, > 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved) > > root@heatpipe:~# numactl --ha > maxn: 7 > available: 8 nodes (0-7) > node 0 cpus: 0 > node 0 size: 458 MB > node 0 free: 424 MB > node 1 cpus: 1 > node 1 size: 491 MB > node 1 free: 481 MB > node 2 cpus: 2 > node 2 size: 491 MB > node 2 free: 482 MB > node 3 cpus: 3 > node 3 size: 491 MB > node 3 free: 485 MB > node 4 cpus: 4 > node 4 size: 491 MB > node 4 free: 485 MB > node 5 cpus: 5 > node 5 size: 491 MB > node 5 free: 484 MB > node 6 cpus: 6 > node 6 size: 491 MB > node 6 free: 486 MB > node 7 cpus: 7 > node 7 size: 476 MB > node 7 free: 471 MB > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 40 40 40 40 40 40 40 > 1: 40 10 40 40 40 40 40 40 > 2: 40 40 10 40 40 40 40 40 > 3: 40 40 40 10 40 40 40 40 > 4: 40 40 40 40 10 40 40 40 > 5: 40 40 40 40 40 10 40 40 > 6: 40 40 40 40 40 40 10 40 > 7: 40 40 40 40 40 40 40 10 > > root@heatpipe:~# numastat > node0 node1 node2 > node3 > numa_hit 182203 14574 23800 > 17017 > numa_miss 0 0 0 > 0 > numa_foreign 0 0 0 > 0 > interleave_hit 1016 1010 1051 > 1030 > local_node 180995 12906 23272 > 15338 > other_node 1208 1668 528 > 1679 > > node4 node5 node6 > node7 > numa_hit 10621 15346 3529 > 3863 > numa_miss 0 0 0 > 0 > numa_foreign 0 0 0 > 0 > interleave_hit 1026 1017 1031 > 1029 > local_node 8941 13680 1855 > 2184 > other_node 1680 1666 1674 > 1679 > > root@superpipe:~# xl debug-keys u > > (XEN) Domain 6 (total: 1024000): > (XEN) Node 0: 321064 > (XEN) Node 1: 702936 > (XEN) Domain has 8 vnodes, 8 vcpus > (XEN) vnode 0 - pnode 1, 500 MB, vcpu nums: 0 > (XEN) vnode 1 - pnode 0, 500 MB, vcpu nums: 1 > (XEN) vnode 2 - pnode 1, 500 MB, vcpu nums: 2 > (XEN) vnode 3 - pnode 1, 500 MB, vcpu nums: 3 > (XEN) vnode 4 - pnode 0, 500 MB, vcpu nums: 4 > (XEN) vnode 5 - pnode 0, 1841 MB, vcpu nums: 5 > (XEN) vnode 6 - pnode 1, 500 MB, vcpu nums: 6 > (XEN) vnode 7 - pnode 1, 500 MB, vcpu nums: 7 > > Current problems: > > Warning on CPU bringup on other node > > The cpus in guest wich belong to different NUMA nodes are configured > to chare same l2 cache and thus considered to be siblings and cannot > be on the same node. One can see following WARNING during the boot > time: > > [ 0.022750] SMP alternatives: switching to SMP code > [ 0.004000] ------------[ cut here ]------------ > [ 0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 > topology_sane.isra.8+0x67/0x79() > [ 0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! > [node: 1 != 0]. Ignoring dependency. > [ 0.004000] Modules linked in: > [ 0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43 > [ 0.004000] 0000000000000000 0000000000000009 ffffffff813df458 > ffff88007abe7e60 > [ 0.004000] ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 > ffffffff00000100 > [ 0.004000] 0000000000000001 ffff8800f6e13900 0000000000000000 > 000000000000b018 > [ 0.004000] Call Trace: > [ 0.004000] [] ? dump_stack+0x41/0x51 > [ 0.004000] [] ? warn_slowpath_common+0x78/0x90 > [ 0.004000] [] ? topology_sane.isra.8+0x67/0x79 > [ 0.004000] [] ? warn_slowpath_fmt+0x45/0x4a > [ 0.004000] [] ? topology_sane.isra.8+0x67/0x79 > [ 0.004000] [] ? set_cpu_sibling_map+0x1c9/0x3f7 > [ 0.004000] [] ? numa_add_cpu+0xa/0x18 > [ 0.004000] [] ? cpu_bringup+0x50/0x8f > [ 0.004000] [] ? cpu_bringup_and_idle+0x1d/0x28 > [ 0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]--- > [ 0.035371] x86: Booted up 2 nodes, 2 CPUs > > The workaround is to specify cpuid in config file and not use SMT. But > soon I will come up > with some other acceptable solution. > > Incorrect amount of memory for nodes in debug-keys output > > Since the node ranges per domain are saved in guest addresses, the > memory > calculated is incorrect due to the guest e820 memory holes for some > nodes. > > TODO: > - some modifications to automatic vnuma placement may be needed; > - vdistance extended configuration parser will need to be in place; > - SMT siblings problem (see above) will need a solution; > > Changes since v5: > - reorganized patches; > - modified domctl hypercall and added locking; > - added XSM hypercalls with basic policies; > - verify 32bit compatibility; > > Elena Ufimtseva (10): > xen: vnuma topology and subop hypercalls > xsm bits for vNUMA hypercalls > vnuma hook to debug-keys u > libxc: Introduce xc_domain_setvnuma to set vNUMA > libxl: vnuma topology configuration parser and doc > libxc: move code to arch_boot_alloc func > libxc: allocate domain memory for vnuma enabled domains > libxl: build numa nodes memory blocks > libxl: vnuma nodes placement bits > libxl: set vnuma for domain > > docs/man/xl.cfg.pod.5 | 77 +++++++ > tools/libxc/xc_dom.h | 11 + > tools/libxc/xc_dom_x86.c | 71 +++++- > tools/libxc/xc_domain.c | 63 ++++++ > tools/libxc/xenctrl.h | 9 + > tools/libxc/xg_private.h | 1 + > tools/libxl/libxl.c | 22 ++ > tools/libxl/libxl.h | 19 ++ > tools/libxl/libxl_create.c | 1 + > tools/libxl/libxl_dom.c | 148 ++++++++++++ > tools/libxl/libxl_internal.h | 12 + > tools/libxl/libxl_numa.c | 193 ++++++++++++++++ > tools/libxl/libxl_types.idl | 6 +- > tools/libxl/libxl_vnuma.h | 8 + > tools/libxl/libxl_x86.c | 3 +- > tools/libxl/xl_cmdimpl.c | 425 > +++++++++++++++++++++++++++++++++++ > xen/arch/x86/numa.c | 29 ++- > xen/common/domain.c | 13 ++ > xen/common/domctl.c | 167 ++++++++++++++ > xen/common/memory.c | 69 ++++++ > xen/include/public/domctl.h | 29 +++ > xen/include/public/memory.h | 47 +++- > xen/include/xen/domain.h | 11 + > xen/include/xen/sched.h | 1 + > xen/include/xsm/dummy.h | 6 + > xen/include/xsm/xsm.h | 7 + > xen/xsm/dummy.c | 1 + > xen/xsm/flask/hooks.c | 10 + > xen/xsm/flask/policy/access_vectors | 4 + > 29 files changed, 1447 insertions(+), 16 deletions(-) > create mode 100644 tools/libxl/libxl_vnuma.h > > -- > 1.7.10.4 > > -- Elena