All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v7 0/9] vnuma introduction
@ 2014-08-21  5:08 Elena Ufimtseva
  2014-08-21  5:08 ` [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls Elena Ufimtseva
  2014-08-21  5:38 ` [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
  0 siblings, 2 replies; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-21  5:08 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponding patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
machines and thus having virtual NUMA topology visible to guests.
XEN vNUMA implementation provides a way to run vNUMA-enabled guests on NUMA/UMA
and flexibly map vNUMA topology to physical NUMA topology.

Mapping to physical NUMA topology may be done in manual and automatic way.
By default, every PV domain has one vNUMA node. It is populated by default
parameters and does not affect performance. To use automatic way of initializing
vNUMA topology, configuration file need only to have number of vNUMA nodes
defined. Not-defined vNUMA topology parameters will be initialized to default
ones.

vNUMA topology is currently defined as a set of parameters such as:
    number of vNUMA nodes;
    distance table;
    vnodes memory sizes;
    vcpus to vnodes mapping;
    vnode to pnode map (for NUMA machines).

This set of patches introduces two hypercall subops: XEN_DOMCTL_setvnumainfo
and XENMEM_get_vnuma_info.

    XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
vNUMA topology with user defined configuration or the parameters by default.
vNUMA is defined for every PV domain and if no vNUMA configuration found,
one vNUMA node is initialized and all cpus are assigned to it. All other
parameters set to their default values.

    XENMEM_gevnumainfo is used by the PV domain to get the information
from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
for different vNUMA parameters and hypervisor fills it with topology.
Future work to use this in HVM guests in the toolstack is required and
in the hypervisor to allow HVM guests to use these hypercalls.

libxl

libxl allows us to define vNUMA topology in configuration file and verifies that
configuration is correct. libxl also verifies mapping of vnodes to pnodes and
uses it in case of NUMA-machine and if automatic placement was disabled. In case
of incorrect/insufficient configuration, one vNUMA node will be initialized
and populated with default values.

libxc

libxc builds the vnodes memory addresses for guest and makes necessary
alignments to the addresses. It also takes into account guest e820 memory map
configuration. The domain memory is allocated and vnode to pnode mapping
is used to determine target node for particular vnode. If this mapping was not
defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
default not node-specific allocation will be used.

hypervisor vNUMA initialization

PV guest

As of now, only PV guest can take advantage of vNUMA functionality.
Such guest allocates the memory for NUMA topology, sets number of nodes and
cpus so hypervisor has information about how much memory guest has
preallocated for vNUMA topology. Further guest makes subop hypercall
XENMEM_getvnumainfo.
If for some reason vNUMA topology cannot be initialized, Linux guest
will have only one NUMA node initialized (standard Linux behavior).
To enable this, vNUMA Linux patches should be applied and vNUMA supporting
patches should be applied to PV kernel.

Linux kernel patch is available here:
https://git.gitorious.org/vnuma/linux_vnuma.git
git://gitorious.org/vnuma/linux_vnuma.git

Automatic vNUMA placement

vNUMA automatic placement will be enabled if numa automatic placement is
not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If
vnode to pnode mapping is correct and automatic NUMA placement disabled,
vNUMA nodes will be allocated on nodes as it was specified in the guest
config file.

Xen patchset is available here:
https://git.gitorious.org/vnuma/xen_vnuma.git
git://gitorious.org/vnuma/xen_vnuma.git


Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

memory = 4000
vcpus = 2
# The name of the domain, change this if you want more than 1 VM.
name = "null"
vnodes = 2
#vnumamem = [3000, 1000]
#vnumamem = [4000,0]
vdistance = [10, 20]
vnuma_vcpumap = [1, 0]
vnuma_vnodemap = [1]
vnuma_autoplacement = 0
#e820_host = 1

[    0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014
[    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff]
[    0.000000]  [mem 0xf9e00000-0xf9ffffff] page 4k
[    0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE
[    0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff]
[    0.000000]  [mem 0xf8000000-0xf9dfffff] page 4k
[    0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE
[    0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE
[    0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE
[    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff]
[    0.000000]  [mem 0x80000000-0xf7ffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff]
[    0.000000]  [mem 0x00100000-0x7fffffff] page 4k
[    0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff]
[    0.000000] Nodes received = 2
[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff]
[    0.000000]   NODE_DATA [mem 0xf9828000-0xf984efff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7cffffff]
[    0.000000]   node   1: [mem 0x7d000000-0xf9ffffff]
[    0.000000] On node 0 totalpages: 511903
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 7936 pages used for memmap
[    0.000000]   DMA32 zone: 507904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 512000
[    0.000000]   DMA32 zone: 8000 pages used for memmap
[    0.000000]   DMA32 zone: 512000 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.5-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2 nr_node_ids:2
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888 r8192 d20608 u2097152
[    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1
[    0.000000] xen: PV spinlocks enabled
[    0.000000] Built 2 zonelists in Node order, mobility grouping on.  Total pages: 1007882
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
[    0.000000] Memory: 3978224K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved)
[    0.000000] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    0.000000] installing Xen timer for CPU 0
[    0.000000] tsc: Detected 2394.276 MHz processor
[    0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4788.55 BogoMIPS (lpj=9577104)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.007935] CPU: Physical Processor ID: 0
[    0.007942] CPU: Processor Core ID: 0
[    0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
[    0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
[    0.007951] tlb_flushall_shift: 6
[    0.021249] cpu 0 spinlock event irq 17
[    0.021292] Performance Events: unsupported p6 CPU model 45 no PMU driver, software events only.
[    0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled
[    0.022625] installing Xen timer for CPU 1

root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 1933 MB
node 0 free: 1894 MB
node 1 cpus: 1
node 1 size: 1951 MB
node 1 free: 1926 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

root@heatpipe:~# numastat
                           node0           node1
numa_hit                   52257           92679
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit              4254            4238
local_node                 52150           87364
other_node                   107            5315

root@superpipe:~# xl debug-keys u

(XEN) Domain 7 (total: 1024000):
(XEN)     Node 0: 1024000
(XEN)     Node 1: 0
(XEN)     Domain has 2 vnodes, 2 vcpus
(XEN)         vnode 0 - pnode 0, 2000 MB, vcpu nums: 0
(XEN)         vnode 1 - pnode 0, 2000 MB, vcpu nums: 1


memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null1"
vnodes = 8
#vnumamem = [3000, 1000]
vdistance = [10, 40]
#vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1]
vnuma_autoplacement = 1
e820_host = 1

[    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
[    0.000000] 1-1 mapping on ac228->100000
[    0.000000] Released 318936 pages of unused memory
[    0.000000] Set 343512 page(s) to 1-1 mapping
[    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
[    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
[    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
[    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
[    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
[    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
[    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable
[    0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS
[    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
[    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
[    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
[    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
[    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
[    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
[    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
[    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
[    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
[    0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
[    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
[    0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE
[    0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE
[    0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE
[    0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
[    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
[    0.000000]  [mem 0x00100000-0xac227fff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff]
[    0.000000] Nodes received = 8
[    0.000000] NUMA: Initialized distance table, cnt=8
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff]
[    0.000000]   NODE_DATA [mem 0x1f3d9000-0x1f3fffff]
[    0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff]
[    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
[    0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff]
[    0.000000]   NODE_DATA [mem 0x5dbd9000-0x5dbfffff]
[    0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff]
[    0.000000]   NODE_DATA [mem 0x9c3d9000-0x9c3fffff]
[    0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff]
[    0.000000]   NODE_DATA [mem 0x10f5b1000-0x10f5d7fff]
[    0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff]
[    0.000000]   NODE_DATA [mem 0x12e9b1000-0x12e9d7fff]
[    0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x1f3fffff]
[    0.000000]   node   1: [mem 0x1f400000-0x3e7fffff]
[    0.000000]   node   2: [mem 0x3e800000-0x5dbfffff]
[    0.000000]   node   3: [mem 0x5dc00000-0x7cffffff]
[    0.000000]   node   4: [mem 0x7d000000-0x9c3fffff]
[    0.000000]   node   5: [mem 0x9c400000-0xac227fff]
[    0.000000]   node   5: [mem 0x100000000-0x10f5d7fff]
[    0.000000]   node   6: [mem 0x10f5d8000-0x12e9d7fff]
[    0.000000]   node   7: [mem 0x12e9d8000-0x14ddd7fff]
[    0.000000] On node 0 totalpages: 127903
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 1936 pages used for memmap
[    0.000000]   DMA32 zone: 123904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 4 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 5 totalpages: 128000
[    0.000000]   DMA32 zone: 1017 pages used for memmap
[    0.000000]   DMA32 zone: 65064 pages, LIFO batch:15
[    0.000000]   Normal zone: 984 pages used for memmap
[    0.000000]   Normal zone: 62936 pages, LIFO batch:15
[    0.000000] On node 6 totalpages: 128000
[    0.000000]   Normal zone: 2000 pages used for memmap
[    0.000000]   Normal zone: 128000 pages, LIFO batch:31
[    0.000000] On node 7 totalpages: 128000
[    0.000000]   Normal zone: 2000 pages used for memmap
[    0.000000]   Normal zone: 128000 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff]
[    0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff]
[    0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff]
[    0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff]
[    0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.5-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8 nr_node_ids:8
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888 r8192 d20608 u2097152
[    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7
[    0.000000] xen: PV spinlocks enabled
[    0.000000] Built 8 zonelists in Node order, mobility grouping on.  Total pages: 1007881
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug  kgdboc=hvc0 nokgdbroundup  initcall_debug debug
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340
[    0.000000] Checking aperture...
[    0.000000] No AGP bridge found
[    0.000000] Memory: 3976748K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved)

root@heatpipe:~# numactl --ha
maxn: 7
available: 8 nodes (0-7)
node 0 cpus: 0
node 0 size: 458 MB
node 0 free: 424 MB
node 1 cpus: 1
node 1 size: 491 MB
node 1 free: 481 MB
node 2 cpus: 2
node 2 size: 491 MB
node 2 free: 482 MB
node 3 cpus: 3
node 3 size: 491 MB
node 3 free: 485 MB
node 4 cpus: 4
node 4 size: 491 MB
node 4 free: 485 MB
node 5 cpus: 5
node 5 size: 491 MB
node 5 free: 484 MB
node 6 cpus: 6
node 6 size: 491 MB
node 6 free: 486 MB
node 7 cpus: 7
node 7 size: 476 MB
node 7 free: 471 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  40  40  40  40  40  40  40
  1:  40  10  40  40  40  40  40  40
  2:  40  40  10  40  40  40  40  40
  3:  40  40  40  10  40  40  40  40
  4:  40  40  40  40  10  40  40  40
  5:  40  40  40  40  40  10  40  40
  6:  40  40  40  40  40  40  10  40
  7:  40  40  40  40  40  40  40  10

root@heatpipe:~# numastat
                           node0           node1           node2           node3
numa_hit                  182203           14574           23800           17017
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              1016            1010            1051            1030
local_node                180995           12906           23272           15338
other_node                  1208            1668             528            1679

                           node4           node5           node6           node7
numa_hit                   10621           15346            3529            3863
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              1026            1017            1031            1029
local_node                  8941           13680            1855            2184
other_node                  1680            1666            1674            1679

root@superpipe:~# xl debug-keys u

(XEN) Domain 6 (total: 1024000):
(XEN)     Node 0: 321064
(XEN)     Node 1: 702936
(XEN)     Domain has 8 vnodes, 8 vcpus
(XEN)         vnode 0 - pnode 1, 500 MB, vcpu nums: 0
(XEN)         vnode 1 - pnode 0, 500 MB, vcpu nums: 1
(XEN)         vnode 2 - pnode 1, 500 MB, vcpu nums: 2
(XEN)         vnode 3 - pnode 1, 500 MB, vcpu nums: 3
(XEN)         vnode 4 - pnode 0, 500 MB, vcpu nums: 4
(XEN)         vnode 5 - pnode 0, 1841 MB, vcpu nums: 5
(XEN)         vnode 6 - pnode 1, 500 MB, vcpu nums: 6
(XEN)         vnode 7 - pnode 1, 500 MB, vcpu nums: 7

Current problems:

This was marked as separate porblem but leaving it here for reference.
Warning on CPU bringup on other node

    The cpus in guest wich belong to different NUMA nodes are configured
    to chare same l2 cache and thus considered to be siblings and cannot
    be on the same node. One can see following WARNING during the boot time:

[    0.022750] SMP alternatives: switching to SMP code
[    0.004000] ------------[ cut here ]------------
[    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 topology_sane.isra.8+0x67/0x79()
[    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.004000] Modules linked in:
[    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
[    0.004000]  0000000000000000 0000000000000009 ffffffff813df458 ffff88007abe7e60
[    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 ffffffff00000100
[    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000 000000000000b018
[    0.004000] Call Trace:
[    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
[    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
[    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
[    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
[    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
[    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
[    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
[    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
[    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
[    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
[    0.035371] x86: Booted up 2 nodes, 2 CPUs

The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
with some other acceptable solution.

Incorrect amount of memory for nodes in debug-keys output

    Since the node ranges per domain are saved in guest addresses, the memory
    calculated is incorrect due to the guest e820 memory holes for some nodes.

TODO:
    - some modifications to automatic vnuma placement may be needed;
    - vdistance extended configuration parser will need to be in place;
    - SMT siblings problem (see above) will need a solution (different series);

Changes since v6:
    - added limit on number of vNUMA nodes per domain (32) on Xen side.
    This will be increased in next version as this limit seem to be not
    bug enough;
    - added read_write lock to synchronize access to vnuma structure to domain structure;
    - added copy back of actual number of vcpus back to guest;
    - added xsm example policies;
    - reorganized series the way that xl implementation goes after libxl
    definitions;
    - changed the idl names for vnuma structure members in libxc;
    - changed the failure path in Xen when setting vnuma topology to dont create default
    node, but fail instead and not introduce different views on vnuma between toolstack
    and Xen;
    - changed failure path when parsing vnuma config to just fail instead of creating single
    default node;

Changes since v5:
    - reorganized patches;
    - modified domctl hypercall and added locking;
    - added XSM hypercalls with basic policies;
    - verify 32bit compatibility;

Elena Ufimtseva (9):
  xen: vnuma topology and subop hypercalls
  xsm bits for vNUMA hypercalls
  vnuma hook to debug-keys u
  libxc: Introduce xc_domain_setvnuma to set vNUMA
  libxl: vnuma types declararion
  libxl: build numa nodes memory blocks
  libxc: allocate domain memory for vnuma enabled
  libxl: vnuma nodes placement bits
  libxl: vnuma topology configuration parser and doc

 docs/man/xl.cfg.pod.5                        |   77 +++++
 tools/flask/policy/policy/modules/xen/xen.if |    3 +-
 tools/flask/policy/policy/modules/xen/xen.te |    2 +-
 tools/libxc/xc_dom.h                         |   13 +
 tools/libxc/xc_dom_x86.c                     |   76 ++++-
 tools/libxc/xc_domain.c                      |   63 ++++
 tools/libxc/xenctrl.h                        |    9 +
 tools/libxl/libxl_create.c                   |    1 +
 tools/libxl/libxl_dom.c                      |  148 +++++++++
 tools/libxl/libxl_internal.h                 |    9 +
 tools/libxl/libxl_numa.c                     |  193 ++++++++++++
 tools/libxl/libxl_types.idl                  |    7 +-
 tools/libxl/libxl_vnuma.h                    |   13 +
 tools/libxl/libxl_x86.c                      |    3 +-
 tools/libxl/xl_cmdimpl.c                     |  425 ++++++++++++++++++++++++++
 xen/arch/x86/numa.c                          |   30 +-
 xen/common/domain.c                          |   15 +
 xen/common/domctl.c                          |  122 ++++++++
 xen/common/memory.c                          |  106 +++++++
 xen/include/public/arch-x86/xen.h            |    9 +
 xen/include/public/domctl.h                  |   29 ++
 xen/include/public/memory.h                  |   47 ++-
 xen/include/xen/domain.h                     |   11 +
 xen/include/xen/sched.h                      |    4 +
 xen/include/xsm/dummy.h                      |    6 +
 xen/include/xsm/xsm.h                        |    7 +
 xen/xsm/dummy.c                              |    1 +
 xen/xsm/flask/hooks.c                        |   10 +
 xen/xsm/flask/policy/access_vectors          |    4 +
 29 files changed, 1425 insertions(+), 18 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.h

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls
  2014-08-21  5:08 [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
@ 2014-08-21  5:08 ` Elena Ufimtseva
  2014-08-22 13:17   ` Jan Beulich
  2014-08-21  5:38 ` [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
  1 sibling, 1 reply; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-21  5:08 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Define interface, structures and hypercalls for toolstack to
build vnuma topology and for guests that wish to retrieve it.
Two subop hypercalls introduced by patch:
XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
and XENMEM_get_vnumainfo to retrieve that topology by guest.
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/common/domain.c               |   15 +++++
 xen/common/domctl.c               |  122 +++++++++++++++++++++++++++++++++++++
 xen/common/memory.c               |   99 ++++++++++++++++++++++++++++++
 xen/include/public/arch-x86/xen.h |    9 +++
 xen/include/public/domctl.h       |   29 +++++++++
 xen/include/public/memory.h       |   47 +++++++++++++-
 xen/include/xen/domain.h          |   11 ++++
 xen/include/xen/sched.h           |    4 ++
 8 files changed, 335 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 1952070..94d977c 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -280,6 +280,8 @@ struct domain *domain_create(
 
     spin_lock_init(&d->pbuf_lock);
 
+    rwlock_init(&d->vnuma_rwlock);
+
     err = -ENOMEM;
     if ( !zalloc_cpumask_var(&d->domain_dirty_cpumask) )
         goto fail;
@@ -588,6 +590,18 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
     return 0;
 }
 
+void vnuma_destroy(struct vnuma_info *vnuma)
+{
+    if ( vnuma )
+    {
+        xfree(vnuma->vmemrange);
+        xfree(vnuma->vcpu_to_vnode);
+        xfree(vnuma->vdistance);
+        xfree(vnuma->vnode_to_pnode);
+        xfree(vnuma);
+    }
+}
+
 int domain_kill(struct domain *d)
 {
     int rc = 0;
@@ -606,6 +620,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem_client);
+        vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         d->tmem_client = NULL;
         /* fallthrough */
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index c326aba..356a3cf 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -297,6 +297,99 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
             guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
 }
 
+/*
+ * Allocates memory for vNUMA, **vnuma should be NULL.
+ * Caller has to make sure that domain has max_pages
+ * and number of vcpus set for domain.
+ * Verifies that single allocation does not exceed
+ * PAGE_SIZE.
+ */
+static int vnuma_alloc(struct vnuma_info **vnuma,
+                       unsigned int nr_vnodes,
+                       unsigned int nr_vcpus)
+{
+    if ( vnuma && *vnuma )
+        return -EINVAL;
+
+    if ( nr_vnodes > XEN_MAX_VNODES )
+        return -EINVAL;
+
+    /*
+     * If XEN_MAX_VNODES increases, these allocations
+     * should be split into PAGE_SIZE allocations
+     * due to XCA-77.
+     */
+    *vnuma = xzalloc(struct vnuma_info);
+    if ( !*vnuma )
+        return -ENOMEM;
+
+    (*vnuma)->vdistance = xmalloc_array(unsigned int, nr_vnodes * nr_vnodes);
+    (*vnuma)->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
+    (*vnuma)->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
+    (*vnuma)->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
+
+    if ( (*vnuma)->vdistance == NULL || (*vnuma)->vmemrange == NULL ||
+         (*vnuma)->vcpu_to_vnode == NULL || (*vnuma)->vnode_to_pnode == NULL )
+    {
+        vnuma_destroy(*vnuma);
+        return -ENOMEM;
+    }
+
+    return 0;
+}
+
+/*
+ * Construct vNUMA topology form u_vnuma struct and return
+ * it in dst.
+ */
+long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
+                const struct domain *d,
+                struct vnuma_info **dst)
+{
+    unsigned int nr_vnodes;
+    long ret = -EINVAL;
+    struct vnuma_info *v = NULL;
+
+    /* If vNUMA topology already set, just exit. */
+    if ( *dst )
+        return ret;
+
+    nr_vnodes = u_vnuma->nr_vnodes;
+
+    if ( nr_vnodes == 0 )
+        return ret;
+
+    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus);
+    if ( ret )
+        return ret;
+
+    ret = -EFAULT;
+
+    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance,
+                         nr_vnodes * nr_vnodes) )
+        goto vnuma_fail;
+
+    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
+        goto vnuma_fail;
+
+    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
+                         d->max_vcpus) )
+        goto vnuma_fail;
+
+    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
+                         nr_vnodes) )
+        goto vnuma_fail;
+
+    v->nr_vnodes = nr_vnodes;
+    *dst = v;
+
+    return 0;
+
+ vnuma_fail:
+    vnuma_destroy(v);
+    return ret;
+}
+
 long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 {
     long ret = 0;
@@ -967,6 +1060,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
     }
     break;
 
+    case XEN_DOMCTL_setvnumainfo:
+    {
+        struct vnuma_info *v = NULL;
+
+        ret = -EINVAL;
+
+        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
+            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
+            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
+            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) ) {
+            break;
+        }
+
+        ret = vnuma_init(&op->u.vnuma, d, &v);
+        if ( ret < 0 )
+            break;
+
+        ASSERT(v != NULL);
+
+        /* overwrite vnuma for domain */
+        write_lock(&d->vnuma_rwlock);
+        vnuma_destroy(d->vnuma);
+        d->vnuma = v;
+        write_unlock(&d->vnuma_rwlock);
+
+        ret = 0;
+    }
+    break;
+
     default:
         ret = arch_do_domctl(op, d, u_domctl);
         break;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index c2dd31b..337914b 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -969,6 +969,105 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         break;
 
+    case XENMEM_get_vnumainfo:
+    {
+        struct vnuma_topology_info topology;
+        struct domain *d;
+        unsigned int dom_vnodes, dom_vcpus;
+        struct vnuma_info vnuma_tmp;
+
+        /*
+         * guest passes nr_vnodes and nr_vcpus thus
+         * we know how much memory guest has allocated.
+         */
+        if ( copy_from_guest(&topology, arg, 1) ||
+            guest_handle_is_null(topology.vmemrange.h) ||
+            guest_handle_is_null(topology.vdistance.h) ||
+            guest_handle_is_null(topology.vcpu_to_vnode.h) )
+            return -EFAULT;
+
+        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
+            return -ESRCH;
+
+        rc = -EOPNOTSUPP;
+
+        read_lock(&d->vnuma_rwlock);
+
+        if ( d->vnuma == NULL )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            rcu_unlock_domain(d);
+            return rc;
+        }
+
+        dom_vnodes = d->vnuma->nr_vnodes;
+        dom_vcpus = d->max_vcpus;
+
+        read_unlock(&d->vnuma_rwlock);
+
+        vnuma_tmp.vdistance = xmalloc_array(unsigned int,
+                                            dom_vnodes * dom_vnodes);
+        vnuma_tmp.vmemrange = xmalloc_array(vmemrange_t, dom_vnodes);
+        vnuma_tmp.vcpu_to_vnode = xmalloc_array(unsigned int, dom_vcpus);
+
+        if ( vnuma_tmp.vdistance == NULL || vnuma_tmp.vmemrange == NULL ||
+             vnuma_tmp.vcpu_to_vnode == NULL )
+        {
+            rc = -ENOMEM;
+            goto vnumainfo_out;
+        }
+
+        read_lock(&d->vnuma_rwlock);
+
+        memcpy(vnuma_tmp.vmemrange, d->vnuma->vmemrange,
+               sizeof(*d->vnuma->vmemrange) * dom_vnodes);
+        memcpy(vnuma_tmp.vdistance, d->vnuma->vdistance,
+               sizeof(*d->vnuma->vdistance) * dom_vnodes * dom_vnodes);
+        memcpy(vnuma_tmp.vcpu_to_vnode, d->vnuma->vcpu_to_vnode,
+               sizeof(*d->vnuma->vcpu_to_vnode) * dom_vcpus);
+
+        read_unlock(&d->vnuma_rwlock);
+
+        /*
+         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
+         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
+         */
+        rc = -ENOBUFS;
+        if ( topology.nr_vnodes < dom_vnodes ||
+             topology.nr_vcpus < dom_vcpus )
+            goto vnumainfo_out;
+
+        rc = -EFAULT;
+
+        if ( copy_to_guest(topology.vmemrange.h, vnuma_tmp.vmemrange,
+                           dom_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( copy_to_guest(topology.vdistance.h, vnuma_tmp.vdistance,
+                           dom_vnodes * dom_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( copy_to_guest(topology.vcpu_to_vnode.h, vnuma_tmp.vcpu_to_vnode,
+                           dom_vcpus) != 0 )
+            goto vnumainfo_out;
+
+        topology.nr_vnodes = dom_vnodes;
+        topology.nr_vcpus = dom_vcpus;
+
+        if ( __copy_to_guest(arg, &topology, 1) != 0 )
+            goto vnumainfo_out;
+
+        rc = 0;
+
+ vnumainfo_out:
+        rcu_unlock_domain(d);
+
+        xfree(vnuma_tmp.vdistance);
+        xfree(vnuma_tmp.vmemrange);
+        xfree(vnuma_tmp.vcpu_to_vnode);
+        break;
+    }
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index f35804b..a4c3d58 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -108,6 +108,15 @@ typedef unsigned long xen_pfn_t;
 /* Maximum number of virtual CPUs in legacy multi-processor guests. */
 #define XEN_LEGACY_MAX_VCPUS 32
 
+/*
+ * Maximum number of virtual NUMA nodes per domain.
+ * This restriction is related to a security advice
+ * XSA-77 and max xmalloc size of PAGE_SIZE. This limit
+ * avoids multi page allocation for vnuma. This limit
+ * will be increased in next version.
+ */
+#define XEN_MAX_VNODES 32
+
 #ifndef __ASSEMBLY__
 
 typedef unsigned long xen_ulong_t;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 5b11bbf..5ee74f4 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -35,6 +35,7 @@
 #include "xen.h"
 #include "grant_table.h"
 #include "hvm/save.h"
+#include "memory.h"
 
 #define XEN_DOMCTL_INTERFACE_VERSION 0x0000000a
 
@@ -934,6 +935,32 @@ struct xen_domctl_vcpu_msrs {
 };
 typedef struct xen_domctl_vcpu_msrs xen_domctl_vcpu_msrs_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpu_msrs_t);
+
+/*
+ * Use in XEN_DOMCTL_setvnumainfo to set
+ * vNUMA domain topology.
+ */
+struct xen_domctl_vnuma {
+    uint32_t nr_vnodes;
+    uint32_t _pad;
+    XEN_GUEST_HANDLE_64(uint) vdistance;
+    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
+
+    /*
+     * vnodes to physical NUMA nodes mask.
+     * This kept on per-domain basis for
+     * interested consumers, such as numa aware ballooning.
+     */
+    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
+
+    /*
+     * memory rages for each vNUMA node
+     */
+    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
+};
+typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
+
 #endif
 
 struct xen_domctl {
@@ -1008,6 +1035,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_cacheflush                    71
 #define XEN_DOMCTL_get_vcpu_msrs                 72
 #define XEN_DOMCTL_set_vcpu_msrs                 73
+#define XEN_DOMCTL_setvnumainfo                  74
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1068,6 +1096,7 @@ struct xen_domctl {
         struct xen_domctl_cacheflush        cacheflush;
         struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
+        struct xen_domctl_vnuma             vnuma;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 2c57aa0..2c212e1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -521,9 +521,54 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
  * The zero value is appropiate.
  */
 
+/* vNUMA node memory range */
+struct vmemrange {
+    uint64_t start, end;
+};
+
+typedef struct vmemrange vmemrange_t;
+DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
+
+/*
+ * vNUMA topology specifies vNUMA node number, distance table,
+ * memory ranges and vcpu mapping provided for guests.
+ * XENMEM_get_vnumainfo hypercall expects to see from guest
+ * nr_vnodes and nr_vcpus to indicate available memory. After
+ * filling guests structures, nr_vnodes and nr_vcpus copied
+ * back to guest.
+ */
+struct vnuma_topology_info {
+    /* IN */
+    domid_t domid;
+    /* IN/OUT */
+    unsigned int nr_vnodes;
+    unsigned int nr_vcpus;
+    /* OUT */
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vdistance;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vcpu_to_vnode;
+    union {
+        XEN_GUEST_HANDLE(vmemrange_t) h;
+        uint64_t pad;
+    } vmemrange;
+};
+typedef struct vnuma_topology_info vnuma_topology_info_t;
+DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
+
+/*
+ * XENMEM_get_vnumainfo used by guest to get
+ * vNUMA topology from hypervisor.
+ */
+#define XENMEM_get_vnumainfo               26
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
-/* Next available subop number is 26 */
+/* Next available subop number is 27 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index bb1c398..d29a84d 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -89,4 +89,15 @@ extern unsigned int xen_processor_pmbits;
 
 extern bool_t opt_dom0_vcpus_pin;
 
+/* vnuma topology per domain. */
+struct vnuma_info {
+    unsigned int nr_vnodes;
+    unsigned int *vdistance;
+    unsigned int *vcpu_to_vnode;
+    unsigned int *vnode_to_pnode;
+    struct vmemrange *vmemrange;
+};
+
+void vnuma_destroy(struct vnuma_info *vnuma);
+
 #endif /* __XEN_DOMAIN_H__ */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 4575dda..c5157e6 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -452,6 +452,10 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+
+    /* vNUMA topology accesses are protected by rwlock. */
+    rwlock_t vnuma_rwlock;
+    struct vnuma_info *vnuma;
 };
 
 struct domain_setup_info
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 0/9] vnuma introduction
  2014-08-21  5:08 [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
  2014-08-21  5:08 ` [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls Elena Ufimtseva
@ 2014-08-21  5:38 ` Elena Ufimtseva
  2014-08-21 14:00   ` Dario Faggioli
  1 sibling, 1 reply; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-21  5:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, Jan Beulich,
	Daniel De Graaf, Elena Ufimtseva

On Thu, Aug 21, 2014 at 1:08 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> vNUMA introduction
>
> This series of patches introduces vNUMA topology awareness and
> provides interfaces and data structures to enable vNUMA for
> PV guests. There is a plan to extend this support for dom0 and
> HVM domains.
>
> vNUMA topology support should be supported by PV guest kernel.
> Corresponding patches should be applied.
>
> Introduction
> -------------
>
> vNUMA topology is exposed to the PV guest to improve performance when running
> workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
> machines and thus having virtual NUMA topology visible to guests.
> XEN vNUMA implementation provides a way to run vNUMA-enabled guests on NUMA/UMA
> and flexibly map vNUMA topology to physical NUMA topology.
>
> Mapping to physical NUMA topology may be done in manual and automatic way.
> By default, every PV domain has one vNUMA node. It is populated by default
> parameters and does not affect performance. To use automatic way of initializing
> vNUMA topology, configuration file need only to have number of vNUMA nodes
> defined. Not-defined vNUMA topology parameters will be initialized to default
> ones.
>
> vNUMA topology is currently defined as a set of parameters such as:
>     number of vNUMA nodes;
>     distance table;
>     vnodes memory sizes;
>     vcpus to vnodes mapping;
>     vnode to pnode map (for NUMA machines).
>
> This set of patches introduces two hypercall subops: XEN_DOMCTL_setvnumainfo
> and XENMEM_get_vnuma_info.
>
>     XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
> vNUMA topology with user defined configuration or the parameters by default.
> vNUMA is defined for every PV domain and if no vNUMA configuration found,
> one vNUMA node is initialized and all cpus are assigned to it. All other
> parameters set to their default values.
>
>     XENMEM_gevnumainfo is used by the PV domain to get the information
> from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
> for different vNUMA parameters and hypervisor fills it with topology.
> Future work to use this in HVM guests in the toolstack is required and
> in the hypervisor to allow HVM guests to use these hypercalls.
>
> libxl
>
> libxl allows us to define vNUMA topology in configuration file and verifies that
> configuration is correct. libxl also verifies mapping of vnodes to pnodes and
> uses it in case of NUMA-machine and if automatic placement was disabled. In case
> of incorrect/insufficient configuration, one vNUMA node will be initialized
> and populated with default values.
>
> libxc
>
> libxc builds the vnodes memory addresses for guest and makes necessary
> alignments to the addresses. It also takes into account guest e820 memory map
> configuration. The domain memory is allocated and vnode to pnode mapping
> is used to determine target node for particular vnode. If this mapping was not
> defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
> default not node-specific allocation will be used.
>
> hypervisor vNUMA initialization
>
> PV guest
>
> As of now, only PV guest can take advantage of vNUMA functionality.
> Such guest allocates the memory for NUMA topology, sets number of nodes and
> cpus so hypervisor has information about how much memory guest has
> preallocated for vNUMA topology. Further guest makes subop hypercall
> XENMEM_getvnumainfo.
> If for some reason vNUMA topology cannot be initialized, Linux guest
> will have only one NUMA node initialized (standard Linux behavior).
> To enable this, vNUMA Linux patches should be applied and vNUMA supporting
> patches should be applied to PV kernel.
>
> Linux kernel patch is available here:
> https://git.gitorious.org/vnuma/linux_vnuma.git
> git://gitorious.org/vnuma/linux_vnuma.git
>
> Automatic vNUMA placement
>
> vNUMA automatic placement will be enabled if numa automatic placement is
> not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If
> vnode to pnode mapping is correct and automatic NUMA placement disabled,
> vNUMA nodes will be allocated on nodes as it was specified in the guest
> config file.
>
> Xen patchset is available here:
> https://git.gitorious.org/vnuma/xen_vnuma.git
> git://gitorious.org/vnuma/xen_vnuma.git
>
>
> Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
>
> memory = 4000
> vcpus = 2
> # The name of the domain, change this if you want more than 1 VM.
> name = "null"
> vnodes = 2
> #vnumamem = [3000, 1000]
> #vnumamem = [4000,0]
> vdistance = [10, 20]
> vnuma_vcpumap = [1, 0]
> vnuma_vnodemap = [1]
> vnuma_autoplacement = 0
> #e820_host = 1
>
> [    0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014
> [    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
> [    0.000000] ACPI in unprivileged domain disabled
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
> [    0.000000] bootconsole [xenboot0] enabled
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff]
> [    0.000000]  [mem 0xf9e00000-0xf9ffffff] page 4k
> [    0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE
> [    0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff]
> [    0.000000]  [mem 0xf8000000-0xf9dfffff] page 4k
> [    0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE
> [    0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE
> [    0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff]
> [    0.000000]  [mem 0x80000000-0xf7ffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff]
> [    0.000000]  [mem 0x00100000-0x7fffffff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff]
> [    0.000000] Nodes received = 2
> [    0.000000] NUMA: Initialized distance table, cnt=2
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff]
> [    0.000000]   NODE_DATA [mem 0xf9828000-0xf984efff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x7cffffff]
> [    0.000000]   node   1: [mem 0x7d000000-0xf9ffffff]
> [    0.000000] On node 0 totalpages: 511903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 7936 pages used for memmap
> [    0.000000]   DMA32 zone: 507904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 512000
> [    0.000000]   DMA32 zone: 8000 pages used for memmap
> [    0.000000]   DMA32 zone: 512000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
> [    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2 nr_node_ids:2
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888 r8192 d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 2 zonelists in Node order, mobility grouping on.  Total pages: 1007882
> [    0.000000] Policy zone: DMA32
> [    0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
> [    0.000000] Memory: 3978224K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved)
> [    0.000000] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> [    0.000000] installing Xen timer for CPU 0
> [    0.000000] tsc: Detected 2394.276 MHz processor
> [    0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4788.55 BogoMIPS (lpj=9577104)
> [    0.004000] pid_max: default: 32768 minimum: 301
> [    0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
> [    0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
> [    0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.007935] CPU: Physical Processor ID: 0
> [    0.007942] CPU: Processor Core ID: 0
> [    0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
> [    0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
> [    0.007951] tlb_flushall_shift: 6
> [    0.021249] cpu 0 spinlock event irq 17
> [    0.021292] Performance Events: unsupported p6 CPU model 45 no PMU driver, software events only.
> [    0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled
> [    0.022625] installing Xen timer for CPU 1
>
> root@heatpipe:~# numactl --ha
> available: 2 nodes (0-1)
> node 0 cpus: 0
> node 0 size: 1933 MB
> node 0 free: 1894 MB
> node 1 cpus: 1
> node 1 size: 1951 MB
> node 1 free: 1926 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> root@heatpipe:~# numastat
>                            node0           node1
> numa_hit                   52257           92679
> numa_miss                      0               0
> numa_foreign                   0               0
> interleave_hit              4254            4238
> local_node                 52150           87364
> other_node                   107            5315
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 7 (total: 1024000):
> (XEN)     Node 0: 1024000
> (XEN)     Node 1: 0
> (XEN)     Domain has 2 vnodes, 2 vcpus
> (XEN)         vnode 0 - pnode 0, 2000 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 2000 MB, vcpu nums: 1
>
>
> memory = 4000
> vcpus = 8
> # The name of the domain, change this if you want more than 1 VM.
> name = "null1"
> vnodes = 8
> #vnumamem = [3000, 1000]
> vdistance = [10, 40]
> #vnuma_vcpumap = [1, 0, 3, 2]
> vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1]
> vnuma_autoplacement = 1
> e820_host = 1
>
> [    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
> [    0.000000] 1-1 mapping on ac228->100000
> [    0.000000] Released 318936 pages of unused memory
> [    0.000000] Set 343512 page(s) to 1-1 mapping
> [    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
> [    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
> [    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS
> [    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
> [    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
> [    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
> [    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
> [    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
> [    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
> [    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
> [    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
> [    0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE
> [    0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE
> [    0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE
> [    0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
> [    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
> [    0.000000]  [mem 0x00100000-0xac227fff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
> [    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff]
> [    0.000000] Nodes received = 8
> [    0.000000] NUMA: Initialized distance table, cnt=8
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff]
> [    0.000000]   NODE_DATA [mem 0x1f3d9000-0x1f3fffff]
> [    0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff]
> [    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
> [    0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   NODE_DATA [mem 0x5dbd9000-0x5dbfffff]
> [    0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   NODE_DATA [mem 0x9c3d9000-0x9c3fffff]
> [    0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff]
> [    0.000000]   NODE_DATA [mem 0x10f5b1000-0x10f5d7fff]
> [    0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff]
> [    0.000000]   NODE_DATA [mem 0x12e9b1000-0x12e9d7fff]
> [    0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff]
> [    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x1f3fffff]
> [    0.000000]   node   1: [mem 0x1f400000-0x3e7fffff]
> [    0.000000]   node   2: [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   node   3: [mem 0x5dc00000-0x7cffffff]
> [    0.000000]   node   4: [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   node   5: [mem 0x9c400000-0xac227fff]
> [    0.000000]   node   5: [mem 0x100000000-0x10f5d7fff]
> [    0.000000]   node   6: [mem 0x10f5d8000-0x12e9d7fff]
> [    0.000000]   node   7: [mem 0x12e9d8000-0x14ddd7fff]
> [    0.000000] On node 0 totalpages: 127903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 1936 pages used for memmap
> [    0.000000]   DMA32 zone: 123904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 2 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 3 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 4 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 5 totalpages: 128000
> [    0.000000]   DMA32 zone: 1017 pages used for memmap
> [    0.000000]   DMA32 zone: 65064 pages, LIFO batch:15
> [    0.000000]   Normal zone: 984 pages used for memmap
> [    0.000000]   Normal zone: 62936 pages, LIFO batch:15
> [    0.000000] On node 6 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 7 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
> [    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff]
> [    0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8 nr_node_ids:8
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888 r8192 d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 8 zonelists in Node order, mobility grouping on.  Total pages: 1007881
> [    0.000000] Policy zone: Normal
> [    0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug  kgdboc=hvc0 nokgdbroundup  initcall_debug debug
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Memory: 3976748K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved)
>
> root@heatpipe:~# numactl --ha
> maxn: 7
> available: 8 nodes (0-7)
> node 0 cpus: 0
> node 0 size: 458 MB
> node 0 free: 424 MB
> node 1 cpus: 1
> node 1 size: 491 MB
> node 1 free: 481 MB
> node 2 cpus: 2
> node 2 size: 491 MB
> node 2 free: 482 MB
> node 3 cpus: 3
> node 3 size: 491 MB
> node 3 free: 485 MB
> node 4 cpus: 4
> node 4 size: 491 MB
> node 4 free: 485 MB
> node 5 cpus: 5
> node 5 size: 491 MB
> node 5 free: 484 MB
> node 6 cpus: 6
> node 6 size: 491 MB
> node 6 free: 486 MB
> node 7 cpus: 7
> node 7 size: 476 MB
> node 7 free: 471 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  40  40  40  40  40  40  40
>   1:  40  10  40  40  40  40  40  40
>   2:  40  40  10  40  40  40  40  40
>   3:  40  40  40  10  40  40  40  40
>   4:  40  40  40  40  10  40  40  40
>   5:  40  40  40  40  40  10  40  40
>   6:  40  40  40  40  40  40  10  40
>   7:  40  40  40  40  40  40  40  10
>
> root@heatpipe:~# numastat
>                            node0           node1           node2           node3
> numa_hit                  182203           14574           23800           17017
> numa_miss                      0               0               0               0
> numa_foreign                   0               0               0               0
> interleave_hit              1016            1010            1051            1030
> local_node                180995           12906           23272           15338
> other_node                  1208            1668             528            1679
>
>                            node4           node5           node6           node7
> numa_hit                   10621           15346            3529            3863
> numa_miss                      0               0               0               0
> numa_foreign                   0               0               0               0
> interleave_hit              1026            1017            1031            1029
> local_node                  8941           13680            1855            2184
> other_node                  1680            1666            1674            1679
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 6 (total: 1024000):
> (XEN)     Node 0: 321064
> (XEN)     Node 1: 702936
> (XEN)     Domain has 8 vnodes, 8 vcpus
> (XEN)         vnode 0 - pnode 1, 500 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 500 MB, vcpu nums: 1
> (XEN)         vnode 2 - pnode 1, 500 MB, vcpu nums: 2
> (XEN)         vnode 3 - pnode 1, 500 MB, vcpu nums: 3
> (XEN)         vnode 4 - pnode 0, 500 MB, vcpu nums: 4
> (XEN)         vnode 5 - pnode 0, 1841 MB, vcpu nums: 5
> (XEN)         vnode 6 - pnode 1, 500 MB, vcpu nums: 6
> (XEN)         vnode 7 - pnode 1, 500 MB, vcpu nums: 7
>
> Current problems:
>
> This was marked as separate porblem but leaving it here for reference.
> Warning on CPU bringup on other node
>
>     The cpus in guest wich belong to different NUMA nodes are configured
>     to chare same l2 cache and thus considered to be siblings and cannot
>     be on the same node. One can see following WARNING during the boot time:
>
> [    0.022750] SMP alternatives: switching to SMP code
> [    0.004000] ------------[ cut here ]------------
> [    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 topology_sane.isra.8+0x67/0x79()
> [    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> [    0.004000] Modules linked in:
> [    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
> [    0.004000]  0000000000000000 0000000000000009 ffffffff813df458 ffff88007abe7e60
> [    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 ffffffff00000100
> [    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000 000000000000b018
> [    0.004000] Call Trace:
> [    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
> [    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
> [    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
> [    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
> [    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
> [    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
> [    0.035371] x86: Booted up 2 nodes, 2 CPUs
>
> The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
> with some other acceptable solution.
>
> Incorrect amount of memory for nodes in debug-keys output
>
>     Since the node ranges per domain are saved in guest addresses, the memory
>     calculated is incorrect due to the guest e820 memory holes for some nodes.
>
> TODO:
>     - some modifications to automatic vnuma placement may be needed;
>     - vdistance extended configuration parser will need to be in place;
>     - SMT siblings problem (see above) will need a solution (different series);
>
> Changes since v6:
>     - added limit on number of vNUMA nodes per domain (32) on Xen side.
>     This will be increased in next version as this limit seem to be not
>     bug enough;
>     - added read_write lock to synchronize access to vnuma structure to domain structure;
>     - added copy back of actual number of vcpus back to guest;
>     - added xsm example policies;
>     - reorganized series the way that xl implementation goes after libxl
>     definitions;
>     - changed the idl names for vnuma structure members in libxc;
>     - changed the failure path in Xen when setting vnuma topology to dont create default
>     node, but fail instead and not introduce different views on vnuma between toolstack
>     and Xen;
>     - changed failure path when parsing vnuma config to just fail instead of creating single
>     default node;
>
> Changes since v5:
>     - reorganized patches;
>     - modified domctl hypercall and added locking;
>     - added XSM hypercalls with basic policies;
>     - verify 32bit compatibility;
>
> Elena Ufimtseva (9):
>   xen: vnuma topology and subop hypercalls
>   xsm bits for vNUMA hypercalls
>   vnuma hook to debug-keys u
>   libxc: Introduce xc_domain_setvnuma to set vNUMA
>   libxl: vnuma types declararion
>   libxl: build numa nodes memory blocks
>   libxc: allocate domain memory for vnuma enabled
>   libxl: vnuma nodes placement bits
>   libxl: vnuma topology configuration parser and doc
>
>  docs/man/xl.cfg.pod.5                        |   77 +++++
>  tools/flask/policy/policy/modules/xen/xen.if |    3 +-
>  tools/flask/policy/policy/modules/xen/xen.te |    2 +-
>  tools/libxc/xc_dom.h                         |   13 +
>  tools/libxc/xc_dom_x86.c                     |   76 ++++-
>  tools/libxc/xc_domain.c                      |   63 ++++
>  tools/libxc/xenctrl.h                        |    9 +
>  tools/libxl/libxl_create.c                   |    1 +
>  tools/libxl/libxl_dom.c                      |  148 +++++++++
>  tools/libxl/libxl_internal.h                 |    9 +
>  tools/libxl/libxl_numa.c                     |  193 ++++++++++++
>  tools/libxl/libxl_types.idl                  |    7 +-
>  tools/libxl/libxl_vnuma.h                    |   13 +
>  tools/libxl/libxl_x86.c                      |    3 +-
>  tools/libxl/xl_cmdimpl.c                     |  425 ++++++++++++++++++++++++++
>  xen/arch/x86/numa.c                          |   30 +-
>  xen/common/domain.c                          |   15 +
>  xen/common/domctl.c                          |  122 ++++++++
>  xen/common/memory.c                          |  106 +++++++
>  xen/include/public/arch-x86/xen.h            |    9 +
>  xen/include/public/domctl.h                  |   29 ++
>  xen/include/public/memory.h                  |   47 ++-
>  xen/include/xen/domain.h                     |   11 +
>  xen/include/xen/sched.h                      |    4 +
>  xen/include/xsm/dummy.h                      |    6 +
>  xen/include/xsm/xsm.h                        |    7 +
>  xen/xsm/dummy.c                              |    1 +
>  xen/xsm/flask/hooks.c                        |   10 +
>  xen/xsm/flask/policy/access_vectors          |    4 +
>  29 files changed, 1425 insertions(+), 18 deletions(-)
>  create mode 100644 tools/libxl/libxl_vnuma.h
>
> --
> 1.7.10.4
>

Hello

I am re-sending this series as previous send of v7 was not yet
reviewed except by Daniel on xsm part.
It has some changes mentioned in the change log of patch 0/9.

Please review this series and send your comments.

-- 
Elena

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 0/9] vnuma introduction
  2014-08-21  5:38 ` [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
@ 2014-08-21 14:00   ` Dario Faggioli
  2014-08-21 14:11     ` Elena Ufimtseva
  0 siblings, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2014-08-21 14:00 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf


[-- Attachment #1.1: Type: text/plain, Size: 704 bytes --]

On gio, 2014-08-21 at 01:38 -0400, Elena Ufimtseva wrote:
> Hello
> 
Hey!

> I am re-sending this series as previous send of v7 was not yet
> reviewed except by Daniel on xsm part.
> It has some changes mentioned in the change log of patch 0/9.
> 
A ping would have probably be enough, but ok... :-P

> Please review this series and send your comments.
> 
Right. I'll let you have my comments on this ASAP.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 0/9] vnuma introduction
  2014-08-21 14:00   ` Dario Faggioli
@ 2014-08-21 14:11     ` Elena Ufimtseva
  2014-08-21 14:15       ` Elena Ufimtseva
  0 siblings, 1 reply; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-21 14:11 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf

On Thu, Aug 21, 2014 at 10:00 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On gio, 2014-08-21 at 01:38 -0400, Elena Ufimtseva wrote:
>> Hello
>>
> Hey!


Hi Dario!

>
>> I am re-sending this series as previous send of v7 was not yet
>> reviewed except by Daniel on xsm part.
>> It has some changes mentioned in the change log of patch 0/9.
>>
> A ping would have probably be enough, but ok... :-P

I pinged everyone at the conference, except you :)
Actually, there are some changes in this series.
>
>> Please review this series and send your comments.
>>
> Right. I'll let you have my comments on this ASAP.
Yes, great!
Thank you )
>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 0/9] vnuma introduction
  2014-08-21 14:11     ` Elena Ufimtseva
@ 2014-08-21 14:15       ` Elena Ufimtseva
  0 siblings, 0 replies; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-21 14:15 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf

On Thu, Aug 21, 2014 at 10:11 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> On Thu, Aug 21, 2014 at 10:00 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On gio, 2014-08-21 at 01:38 -0400, Elena Ufimtseva wrote:
>>> Hello
>>>
>> Hey!
>
>
> Hi Dario!
>
>>
>>> I am re-sending this series as previous send of v7 was not yet
>>> reviewed except by Daniel on xsm part.
>>> It has some changes mentioned in the change log of patch 0/9.
>>>
>> A ping would have probably be enough, but ok... :-P
>
> I pinged everyone at the conference, except you :)
> Actually, there are some changes in this series.
>>
>>> Please review this series and send your comments.
>>>
>> Right. I'll let you have my comments on this ASAP.
> Yes, great!
> Thank you )

If you find that re-send is not the best thing to do here, I can
re-send the series with new version.
>>
>> Regards,
>> Dario
>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
>
> --
> Elena



-- 
Elena

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls
  2014-08-21  5:08 ` [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls Elena Ufimtseva
@ 2014-08-22 13:17   ` Jan Beulich
  2014-08-22 13:54     ` Dario Faggioli
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2014-08-22 13:17 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel

>>> On 21.08.14 at 07:08, <ufimtseva@gmail.com> wrote:
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -297,6 +297,99 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
>              guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
>  }
>  
> +/*
> + * Allocates memory for vNUMA, **vnuma should be NULL.
> + * Caller has to make sure that domain has max_pages
> + * and number of vcpus set for domain.
> + * Verifies that single allocation does not exceed
> + * PAGE_SIZE.
> + */
> +static int vnuma_alloc(struct vnuma_info **vnuma,
> +                       unsigned int nr_vnodes,
> +                       unsigned int nr_vcpus)
> +{
> +    if ( vnuma && *vnuma )

I can see the point of the right side of the &&, but the left side doesn't
seem to make sense in a static function - the callers should get it right,
or it's a bug, not an error.

> +        return -EINVAL;
> +
> +    if ( nr_vnodes > XEN_MAX_VNODES )
> +        return -EINVAL;
> +
> +    /*
> +     * If XEN_MAX_VNODES increases, these allocations
> +     * should be split into PAGE_SIZE allocations
> +     * due to XCA-77.
> +     */
> +    *vnuma = xzalloc(struct vnuma_info);

This could be xmalloc(), since it doesn't get installed into struct
domain anyway before ->nr_vnodes got set properly (and all
the allocations below don't zero their memory either).

> +    if ( !*vnuma )
> +        return -ENOMEM;
> +
> +    (*vnuma)->vdistance = xmalloc_array(unsigned int, nr_vnodes * nr_vnodes);
> +    (*vnuma)->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
> +    (*vnuma)->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
> +    (*vnuma)->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
> +
> +    if ( (*vnuma)->vdistance == NULL || (*vnuma)->vmemrange == NULL ||
> +         (*vnuma)->vcpu_to_vnode == NULL || (*vnuma)->vnode_to_pnode == NULL )
> +    {
> +        vnuma_destroy(*vnuma);
> +        return -ENOMEM;
> +    }
> +
> +    return 0;
> +}
> +

I think vnuma_destroy() would better go here than in a different
source file.

> +/*
> + * Construct vNUMA topology form u_vnuma struct and return
> + * it in dst.
> + */
> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
> +                const struct domain *d,
> +                struct vnuma_info **dst)
> +{
> +    unsigned int nr_vnodes;
> +    long ret = -EINVAL;

Any reason for this and the function return type being "long" rather
than "int"?

> +    struct vnuma_info *v = NULL;
> +
> +    /* If vNUMA topology already set, just exit. */
> +    if ( *dst )
> +        return ret;
> +
> +    nr_vnodes = u_vnuma->nr_vnodes;
> +
> +    if ( nr_vnodes == 0 )
> +        return ret;
> +
> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus);

If you made use of the definitions in xen/err.h you could avoid the
indirection on the first argument (dropping it altogether).

> +    if ( ret )
> +        return ret;
> +
> +    ret = -EFAULT;
> +
> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance,
> +                         nr_vnodes * nr_vnodes) )
> +        goto vnuma_fail;
> +
> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> +        goto vnuma_fail;

Isn't a single memory range per vnode rather limiting? Physical
machines frequently have at least one node with two ranges to
accommodate the hole below 4Gb. And since the interface is for
all guest kinds I'm afraid this will harm setting up guests rather
sooner than later. (I'm sorry for thinking of this only now.)

> +
> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
> +                         d->max_vcpus) )
> +        goto vnuma_fail;
> +
> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
> +                         nr_vnodes) )
> +        goto vnuma_fail;
> +
> +    v->nr_vnodes = nr_vnodes;
> +    *dst = v;
> +
> +    return 0;
> +
> + vnuma_fail:
> +    vnuma_destroy(v);
> +    return ret;
> +}
> +
>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>  {
>      long ret = 0;
> @@ -967,6 +1060,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) 
> u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        struct vnuma_info *v = NULL;

Bad variable name (normally stands for vCPU).

> +
> +        ret = -EINVAL;
> +
> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) ) {

Indentation. Also - is there a precedent to doing such checks in
elsewhere? Unless NIL handles have a special meaning we normally
just implicitly produce -EFAULT when accessing the usually not
mapped memory region at address zero.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -969,6 +969,105 @@ long do_memory_op(unsigned long cmd, 
> XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          break;
>  
> +    case XENMEM_get_vnumainfo:
> +    {
> +        struct vnuma_topology_info topology;
> +        struct domain *d;

There is a suitable variable already at function scope.

> +        unsigned int dom_vnodes, dom_vcpus;
> +        struct vnuma_info vnuma_tmp;

Calling this just "tmp" would be quite okay here.

> +
> +        /*
> +         * guest passes nr_vnodes and nr_vcpus thus
> +         * we know how much memory guest has allocated.
> +         */
> +        if ( copy_from_guest(&topology, arg, 1) ||
> +            guest_handle_is_null(topology.vmemrange.h) ||
> +            guest_handle_is_null(topology.vdistance.h) ||
> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )

Same two remarks here.

> +            return -EFAULT;
> +
> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
> +            return -ESRCH;
> +
> +        rc = -EOPNOTSUPP;

This is pointless - the set value is being used only in one error path
below where you could as well use the literal.

> +
> +        read_lock(&d->vnuma_rwlock);
> +
> +        if ( d->vnuma == NULL )
> +        {
> +            read_unlock(&d->vnuma_rwlock);
> +            rcu_unlock_domain(d);
> +            return rc;
> +        }
> +
> +        dom_vnodes = d->vnuma->nr_vnodes;
> +        dom_vcpus = d->max_vcpus;
> +
> +        read_unlock(&d->vnuma_rwlock);
> +
> +        vnuma_tmp.vdistance = xmalloc_array(unsigned int,
> +                                            dom_vnodes * dom_vnodes);
> +        vnuma_tmp.vmemrange = xmalloc_array(vmemrange_t, dom_vnodes);
> +        vnuma_tmp.vcpu_to_vnode = xmalloc_array(unsigned int, dom_vcpus);
> +
> +        if ( vnuma_tmp.vdistance == NULL || vnuma_tmp.vmemrange == NULL ||
> +             vnuma_tmp.vcpu_to_vnode == NULL )
> +        {
> +            rc = -ENOMEM;
> +            goto vnumainfo_out;
> +        }

Please use consistent style when the body of the might be just a
singe control transfer: Further down you set "rc" before the if,
here you do it inside its body.

> +
> +        read_lock(&d->vnuma_rwlock);

Even if generally better avoided I think you'd be better off not
dropping the lock, as that way you may return inconsistent data
to your caller. Or alternatively check that the two counts didn't
change by now, starting over if they did.

> +
> +        memcpy(vnuma_tmp.vmemrange, d->vnuma->vmemrange,
> +               sizeof(*d->vnuma->vmemrange) * dom_vnodes);
> +        memcpy(vnuma_tmp.vdistance, d->vnuma->vdistance,
> +               sizeof(*d->vnuma->vdistance) * dom_vnodes * dom_vnodes);
> +        memcpy(vnuma_tmp.vcpu_to_vnode, d->vnuma->vcpu_to_vnode,
> +               sizeof(*d->vnuma->vcpu_to_vnode) * dom_vcpus);
> +
> +        read_unlock(&d->vnuma_rwlock);
> +
> +        /*
> +         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
> +         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
> +         */
> +        rc = -ENOBUFS;
> +        if ( topology.nr_vnodes < dom_vnodes ||
> +             topology.nr_vcpus < dom_vcpus )
> +            goto vnumainfo_out;

Wouldn't this better be done earlier (before allocating any of the
intermediate buffers)? And shouldn't you tell the caller the needed
values?

> +
> +        rc = -EFAULT;
> +
> +        if ( copy_to_guest(topology.vmemrange.h, vnuma_tmp.vmemrange,
> +                           dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vdistance.h, vnuma_tmp.vdistance,
> +                           dom_vnodes * dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vcpu_to_vnode.h, vnuma_tmp.vcpu_to_vnode,
> +                           dom_vcpus) != 0 )
> +            goto vnumainfo_out;
> +
> +        topology.nr_vnodes = dom_vnodes;
> +        topology.nr_vcpus = dom_vcpus;
> +
> +        if ( __copy_to_guest(arg, &topology, 1) != 0 )
> +            goto vnumainfo_out;
> +
> +        rc = 0;
> +
> + vnumainfo_out:
> +        rcu_unlock_domain(d);
> +
> +        xfree(vnuma_tmp.vdistance);
> +        xfree(vnuma_tmp.vmemrange);
> +        xfree(vnuma_tmp.vcpu_to_vnode);
> +        break;
> +    }
> +
>      default:
>          rc = arch_memory_op(cmd, arg);
>          break;
> diff --git a/xen/include/public/arch-x86/xen.h 
> b/xen/include/public/arch-x86/xen.h
> index f35804b..a4c3d58 100644
> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -108,6 +108,15 @@ typedef unsigned long xen_pfn_t;
>  /* Maximum number of virtual CPUs in legacy multi-processor guests. */
>  #define XEN_LEGACY_MAX_VCPUS 32
>  
> +/*
> + * Maximum number of virtual NUMA nodes per domain.
> + * This restriction is related to a security advice
> + * XSA-77 and max xmalloc size of PAGE_SIZE. This limit
> + * avoids multi page allocation for vnuma. This limit
> + * will be increased in next version.
> + */
> +#define XEN_MAX_VNODES 32

This is pointless (and perhaps in the wrong header) - it's purely an
implementation detail that there currently is such a limit. You don't
even need the #define in an internal header afaict, since the only
place it's being used could easily do a suitable check against PAGE_SIZE.

> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -521,9 +521,54 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>   * The zero value is appropiate.
>   */
>  
> +/* vNUMA node memory range */
> +struct vmemrange {
> +    uint64_t start, end;
> +};
> +
> +typedef struct vmemrange vmemrange_t;
> +DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
> +
> +/*
> + * vNUMA topology specifies vNUMA node number, distance table,
> + * memory ranges and vcpu mapping provided for guests.
> + * XENMEM_get_vnumainfo hypercall expects to see from guest
> + * nr_vnodes and nr_vcpus to indicate available memory. After
> + * filling guests structures, nr_vnodes and nr_vcpus copied
> + * back to guest.
> + */
> +struct vnuma_topology_info {
> +    /* IN */
> +    domid_t domid;
> +    /* IN/OUT */
> +    unsigned int nr_vnodes;

Please make the padding between above two fields explicit.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -452,6 +452,10 @@ struct domain
>      nodemask_t node_affinity;
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
> +
> +    /* vNUMA topology accesses are protected by rwlock. */
> +    rwlock_t vnuma_rwlock;

I think there's no need for the "rw" in the field name.

Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls
  2014-08-22 13:17   ` Jan Beulich
@ 2014-08-22 13:54     ` Dario Faggioli
  2014-08-22 17:38       ` Elena Ufimtseva
  0 siblings, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2014-08-22 13:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	stefano.stabellini, ian.jackson, xen-devel, Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --]

On ven, 2014-08-22 at 14:17 +0100, Jan Beulich wrote:

> > +    if ( ret )
> > +        return ret;
> > +
> > +    ret = -EFAULT;
> > +
> > +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance,
> > +                         nr_vnodes * nr_vnodes) )
> > +        goto vnuma_fail;
> > +
> > +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> > +        goto vnuma_fail;
> 
> Isn't a single memory range per vnode rather limiting? Physical
> machines frequently have at least one node with two ranges to
> accommodate the hole below 4Gb. And since the interface is for
> all guest kinds I'm afraid this will harm setting up guests rather
> sooner than later. (I'm sorry for thinking of this only now.)
> 
This actually was one concern of mine too, during the early stage of
Elena's work. I clearly remember wondering, and even asking, considering
that Linux uses more ranges per node, why using only one was ok for us.

Unfortunately, I OTOH forgot how that discussion ended and how we got to
this point, without me or anyone else (continuing to) complaining... So,
sorry from me too. :-(

Let's see what Elena and others think and, after that, how big of a
piece of work is to introduce support for more than one ranges (or
re-introduce, as I also think I remember this to be some kind of list,
during early versions of the series).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls
  2014-08-22 13:54     ` Dario Faggioli
@ 2014-08-22 17:38       ` Elena Ufimtseva
  2014-08-23  4:51         ` Elena Ufimtseva
  0 siblings, 1 reply; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-22 17:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Li Yechen, George Dunlap, Matt Wilson,
	Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich

On Fri, Aug 22, 2014 at 9:54 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On ven, 2014-08-22 at 14:17 +0100, Jan Beulich wrote:
>
>> > +    if ( ret )
>> > +        return ret;
>> > +
>> > +    ret = -EFAULT;
>> > +
>> > +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance,
>> > +                         nr_vnodes * nr_vnodes) )
>> > +        goto vnuma_fail;
>> > +
>> > +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>> > +        goto vnuma_fail;
>>
>> Isn't a single memory range per vnode rather limiting? Physical
>> machines frequently have at least one node with two ranges to
>> accommodate the hole below 4Gb. And since the interface is for
>> all guest kinds I'm afraid this will harm setting up guests rather
>> sooner than later. (I'm sorry for thinking of this only now.)
>>
> This actually was one concern of mine too, during the early stage of
> Elena's work. I clearly remember wondering, and even asking, considering
> that Linux uses more ranges per node, why using only one was ok for us.
>
> Unfortunately, I OTOH forgot how that discussion ended and how we got to
> this point, without me or anyone else (continuing to) complaining... So,
> sorry from me too. :-(
>
> Let's see what Elena and others think and, after that, how big of a
> piece of work is to introduce support for more than one ranges (or
> re-introduce, as I also think I remember this to be some kind of list,
> during early versions of the series).
>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>


Jan, Dario

Thank you for review. I briefly read them and I will give it a more
through read after work.

As for many-regions nodes, I will give it a thought as there maybe a
simple extension to memory structure.

I will try to evaluate asap and reply.

Elena
-- 
Elena

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls
  2014-08-22 17:38       ` Elena Ufimtseva
@ 2014-08-23  4:51         ` Elena Ufimtseva
  0 siblings, 0 replies; 10+ messages in thread
From: Elena Ufimtseva @ 2014-08-23  4:51 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Li Yechen, George Dunlap, Matt Wilson,
	Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich

On Fri, Aug 22, 2014 at 1:38 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> On Fri, Aug 22, 2014 at 9:54 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On ven, 2014-08-22 at 14:17 +0100, Jan Beulich wrote:
>>
>>> > +    if ( ret )
>>> > +        return ret;
>>> > +
>>> > +    ret = -EFAULT;
>>> > +
>>> > +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance,
>>> > +                         nr_vnodes * nr_vnodes) )
>>> > +        goto vnuma_fail;
>>> > +
>>> > +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>>> > +        goto vnuma_fail;
>>>
>>> Isn't a single memory range per vnode rather limiting? Physical
>>> machines frequently have at least one node with two ranges to
>>> accommodate the hole below 4Gb. And since the interface is for
>>> all guest kinds I'm afraid this will harm setting up guests rather
>>> sooner than later. (I'm sorry for thinking of this only now.)
>>>
>> This actually was one concern of mine too, during the early stage of
>> Elena's work. I clearly remember wondering, and even asking, considering
>> that Linux uses more ranges per node, why using only one was ok for us.
>>
>> Unfortunately, I OTOH forgot how that discussion ended and how we got to
>> this point, without me or anyone else (continuing to) complaining... So,
>> sorry from me too. :-(
>>
>> Let's see what Elena and others think and, after that, how big of a
>> piece of work is to introduce support for more than one ranges (or
>> re-introduce, as I also think I remember this to be some kind of list,
>> during early versions of the series).
>>
>> Regards,
>> Dario
>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
> Jan, Dario
>
> Thank you for review. I briefly read them and I will give it a more
> through read after work.
>
> As for many-regions nodes, I will give it a thought as there maybe a
> simple extension to memory structure.
>
> I will try to evaluate asap and reply.
>
> Elena
> --
> Elena


Dario

Can you please take a look at libxl placement part and check if you
have any suggestions.
That probably will change in other versions, I just wanted to run it
with you for the case I have now.
You are going for vacation next week right?

I will try to work over the weekend and see how I can extend interface
with minimum invasion to
accommodate multi-region nodes, but placement part is unlikely will change.

Thank you!

-- 
Elena

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-08-23  4:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21  5:08 [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
2014-08-21  5:08 ` [PATCH RESEND v7 1/9] xen: vnuma topology and subop hypercalls Elena Ufimtseva
2014-08-22 13:17   ` Jan Beulich
2014-08-22 13:54     ` Dario Faggioli
2014-08-22 17:38       ` Elena Ufimtseva
2014-08-23  4:51         ` Elena Ufimtseva
2014-08-21  5:38 ` [PATCH RESEND v7 0/9] vnuma introduction Elena Ufimtseva
2014-08-21 14:00   ` Dario Faggioli
2014-08-21 14:11     ` Elena Ufimtseva
2014-08-21 14:15       ` Elena Ufimtseva

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.