All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/10] vnuma introduction
@ 2014-07-18  5:49 Elena Ufimtseva
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
                   ` (12 more replies)
  0 siblings, 13 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:49 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponding patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
machines and thus having virtual NUMA topology visible to guests.
XEN vNUMA implementation provides a way to run vNUMA-enabled guests on NUMA/UMA
and flexibly map vNUMA topology to physical NUMA topology.

Mapping to physical NUMA topology may be done in manual and automatic way.
By default, every PV domain has one vNUMA node. It is populated by default
parameters and does not affect performance. To use automatic way of initializing
vNUMA topology, configuration file need only to have number of vNUMA nodes
defined. Not-defined vNUMA topology parameters will be initialized to default
ones.

vNUMA topology is currently defined as a set of parameters such as:
    number of vNUMA nodes;
    distance table;
    vnodes memory sizes;
    vcpus to vnodes mapping;
    vnode to pnode map (for NUMA machines).

This set of patches introduces two hypercall subops: XEN_DOMCTL_setvnumainfo
and XENMEM_get_vnuma_info.

    XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
vNUMA topology with user defined configuration or the parameters by default.
vNUMA is defined for every PV domain and if no vNUMA configuration found,
one vNUMA node is initialized and all cpus are assigned to it. All other
parameters set to their default values.

    XENMEM_gevnumainfo is used by the PV domain to get the information
from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
for different vNUMA parameters and hypervisor fills it with topology.
Future work to use this in HVM guests in the toolstack is required and
in the hypervisor to allow HVM guests to use these hypercalls.

libxl

libxl allows us to define vNUMA topology in configuration file and verifies that
configuration is correct. libxl also verifies mapping of vnodes to pnodes and
uses it in case of NUMA-machine and if automatic placement was disabled. In case
of incorrect/insufficient configuration, one vNUMA node will be initialized
and populated with default values.

libxc

libxc builds the vnodes memory addresses for guest and makes necessary
alignments to the addresses. It also takes into account guest e820 memory map
configuration. The domain memory is allocated and vnode to pnode mapping
is used to determine target node for particular vnode. If this mapping was not
defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
default not node-specific allocation will be used.

hypervisor vNUMA initialization

PV guest

As of now, only PV guest can take advantage of vNUMA functionality.
Such guest allocates the memory for NUMA topology, sets number of nodes and
cpus so hypervisor has information about how much memory guest has
preallocated for vNUMA topology. Further guest makes subop hypercall
XENMEM_getvnumainfo.
If for some reason vNUMA topology cannot be initialized, Linux guest
will have only one NUMA node initialized (standard Linux behavior).
To enable this, vNUMA Linux patches should be applied and vNUMA supporting
patches should be applied to PV kernel.

Linux kernel patch is available here:
https://git.gitorious.org/vnuma/linux_vnuma.git
git://gitorious.org/vnuma/linux_vnuma.git

Automatic vNUMA placement

vNUMA automatic placement will be enabled if numa automatic placement is
not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If
vnode to pnode mapping is correct and automatic NUMA placement disabled,
vNUMA nodes will be allocated on nodes as it was specified in the guest
config file.

Xen patchset is available here:
https://git.gitorious.org/vnuma/xen_vnuma.git
git://gitorious.org/vnuma/xen_vnuma.git


Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

memory = 4000
vcpus = 2
# The name of the domain, change this if you want more than 1 VM.
name = "null"
vnodes = 2
#vnumamem = [3000, 1000]
#vnumamem = [4000,0]
vdistance = [10, 20]
vnuma_vcpumap = [1, 0]
vnuma_vnodemap = [1]
vnuma_autoplacement = 0
#e820_host = 1

[    0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014
[    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff]
[    0.000000]  [mem 0xf9e00000-0xf9ffffff] page 4k
[    0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE
[    0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff]
[    0.000000]  [mem 0xf8000000-0xf9dfffff] page 4k
[    0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE
[    0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE
[    0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE
[    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff]
[    0.000000]  [mem 0x80000000-0xf7ffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff]
[    0.000000]  [mem 0x00100000-0x7fffffff] page 4k
[    0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff]
[    0.000000] Nodes received = 2
[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff]
[    0.000000]   NODE_DATA [mem 0xf9828000-0xf984efff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7cffffff]
[    0.000000]   node   1: [mem 0x7d000000-0xf9ffffff]
[    0.000000] On node 0 totalpages: 511903
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 7936 pages used for memmap
[    0.000000]   DMA32 zone: 507904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 512000
[    0.000000]   DMA32 zone: 8000 pages used for memmap
[    0.000000]   DMA32 zone: 512000 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.5-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2 nr_node_ids:2
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888 r8192 d20608 u2097152
[    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1
[    0.000000] xen: PV spinlocks enabled
[    0.000000] Built 2 zonelists in Node order, mobility grouping on.  Total pages: 1007882
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug
[    0.000000] Memory: 3978224K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved)
[    0.000000] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    0.000000] installing Xen timer for CPU 0
[    0.000000] tsc: Detected 2394.276 MHz processor
[    0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4788.55 BogoMIPS (lpj=9577104)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.007935] CPU: Physical Processor ID: 0
[    0.007942] CPU: Processor Core ID: 0
[    0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
[    0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
[    0.007951] tlb_flushall_shift: 6
[    0.021249] cpu 0 spinlock event irq 17
[    0.021292] Performance Events: unsupported p6 CPU model 45 no PMU driver, software events only.
[    0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled
[    0.022625] installing Xen timer for CPU 1

root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 1933 MB
node 0 free: 1894 MB
node 1 cpus: 1
node 1 size: 1951 MB
node 1 free: 1926 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

root@heatpipe:~# numastat
                           node0           node1
numa_hit                   52257           92679
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit              4254            4238
local_node                 52150           87364
other_node                   107            5315

root@superpipe:~# xl debug-keys u

(XEN) Domain 7 (total: 1024000):
(XEN)     Node 0: 1024000
(XEN)     Node 1: 0
(XEN)     Domain has 2 vnodes, 2 vcpus
(XEN)         vnode 0 - pnode 0, 2000 MB, vcpu nums: 0
(XEN)         vnode 1 - pnode 0, 2000 MB, vcpu nums: 1


memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null1"
vnodes = 8
#vnumamem = [3000, 1000]
vdistance = [10, 40]
#vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1]
vnuma_autoplacement = 1
e820_host = 1

[    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
[    0.000000] 1-1 mapping on ac228->100000
[    0.000000] Released 318936 pages of unused memory
[    0.000000] Set 343512 page(s) to 1-1 mapping
[    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
[    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
[    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
[    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
[    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
[    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
[    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable
[    0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS
[    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
[    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
[    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
[    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
[    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
[    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
[    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
[    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
[    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
[    0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
[    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
[    0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE
[    0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE
[    0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE
[    0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
[    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
[    0.000000]  [mem 0x00100000-0xac227fff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff]
[    0.000000] Nodes received = 8
[    0.000000] NUMA: Initialized distance table, cnt=8
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff]
[    0.000000]   NODE_DATA [mem 0x1f3d9000-0x1f3fffff]
[    0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff]
[    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
[    0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff]
[    0.000000]   NODE_DATA [mem 0x5dbd9000-0x5dbfffff]
[    0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff]
[    0.000000]   NODE_DATA [mem 0x9c3d9000-0x9c3fffff]
[    0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff]
[    0.000000]   NODE_DATA [mem 0x10f5b1000-0x10f5d7fff]
[    0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff]
[    0.000000]   NODE_DATA [mem 0x12e9b1000-0x12e9d7fff]
[    0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x1f3fffff]
[    0.000000]   node   1: [mem 0x1f400000-0x3e7fffff]
[    0.000000]   node   2: [mem 0x3e800000-0x5dbfffff]
[    0.000000]   node   3: [mem 0x5dc00000-0x7cffffff]
[    0.000000]   node   4: [mem 0x7d000000-0x9c3fffff]
[    0.000000]   node   5: [mem 0x9c400000-0xac227fff]
[    0.000000]   node   5: [mem 0x100000000-0x10f5d7fff]
[    0.000000]   node   6: [mem 0x10f5d8000-0x12e9d7fff]
[    0.000000]   node   7: [mem 0x12e9d8000-0x14ddd7fff]
[    0.000000] On node 0 totalpages: 127903
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 1936 pages used for memmap
[    0.000000]   DMA32 zone: 123904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 4 totalpages: 128000
[    0.000000]   DMA32 zone: 2000 pages used for memmap
[    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
[    0.000000] On node 5 totalpages: 128000
[    0.000000]   DMA32 zone: 1017 pages used for memmap
[    0.000000]   DMA32 zone: 65064 pages, LIFO batch:15
[    0.000000]   Normal zone: 984 pages used for memmap
[    0.000000]   Normal zone: 62936 pages, LIFO batch:15
[    0.000000] On node 6 totalpages: 128000
[    0.000000]   Normal zone: 2000 pages used for memmap
[    0.000000]   Normal zone: 128000 pages, LIFO batch:31
[    0.000000] On node 7 totalpages: 128000
[    0.000000]   Normal zone: 2000 pages used for memmap
[    0.000000]   Normal zone: 128000 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff]
[    0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff]
[    0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff]
[    0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff]
[    0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff]
[    0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff]
[    0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff]
[    0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff]
[    0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff]
[    0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.5-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8 nr_node_ids:8
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888 r8192 d20608 u2097152
[    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7
[    0.000000] xen: PV spinlocks enabled
[    0.000000] Built 8 zonelists in Node order, mobility grouping on.  Total pages: 1007881
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug  kgdboc=hvc0 nokgdbroundup  initcall_debug debug
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340
[    0.000000] Checking aperture...
[    0.000000] No AGP bridge found
[    0.000000] Memory: 3976748K/4095612K available (4022K kernel code, 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved)

root@heatpipe:~# numactl --ha
maxn: 7
available: 8 nodes (0-7)
node 0 cpus: 0
node 0 size: 458 MB
node 0 free: 424 MB
node 1 cpus: 1
node 1 size: 491 MB
node 1 free: 481 MB
node 2 cpus: 2
node 2 size: 491 MB
node 2 free: 482 MB
node 3 cpus: 3
node 3 size: 491 MB
node 3 free: 485 MB
node 4 cpus: 4
node 4 size: 491 MB
node 4 free: 485 MB
node 5 cpus: 5
node 5 size: 491 MB
node 5 free: 484 MB
node 6 cpus: 6
node 6 size: 491 MB
node 6 free: 486 MB
node 7 cpus: 7
node 7 size: 476 MB
node 7 free: 471 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  40  40  40  40  40  40  40
  1:  40  10  40  40  40  40  40  40
  2:  40  40  10  40  40  40  40  40
  3:  40  40  40  10  40  40  40  40
  4:  40  40  40  40  10  40  40  40
  5:  40  40  40  40  40  10  40  40
  6:  40  40  40  40  40  40  10  40
  7:  40  40  40  40  40  40  40  10

root@heatpipe:~# numastat
                           node0           node1           node2           node3
numa_hit                  182203           14574           23800           17017
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              1016            1010            1051            1030
local_node                180995           12906           23272           15338
other_node                  1208            1668             528            1679

                           node4           node5           node6           node7
numa_hit                   10621           15346            3529            3863
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              1026            1017            1031            1029
local_node                  8941           13680            1855            2184
other_node                  1680            1666            1674            1679

root@superpipe:~# xl debug-keys u

(XEN) Domain 6 (total: 1024000):
(XEN)     Node 0: 321064
(XEN)     Node 1: 702936
(XEN)     Domain has 8 vnodes, 8 vcpus
(XEN)         vnode 0 - pnode 1, 500 MB, vcpu nums: 0
(XEN)         vnode 1 - pnode 0, 500 MB, vcpu nums: 1
(XEN)         vnode 2 - pnode 1, 500 MB, vcpu nums: 2
(XEN)         vnode 3 - pnode 1, 500 MB, vcpu nums: 3
(XEN)         vnode 4 - pnode 0, 500 MB, vcpu nums: 4
(XEN)         vnode 5 - pnode 0, 1841 MB, vcpu nums: 5
(XEN)         vnode 6 - pnode 1, 500 MB, vcpu nums: 6
(XEN)         vnode 7 - pnode 1, 500 MB, vcpu nums: 7

Current problems:

Warning on CPU bringup on other node

    The cpus in guest wich belong to different NUMA nodes are configured
    to chare same l2 cache and thus considered to be siblings and cannot
    be on the same node. One can see following WARNING during the boot time:

[    0.022750] SMP alternatives: switching to SMP code
[    0.004000] ------------[ cut here ]------------
[    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 topology_sane.isra.8+0x67/0x79()
[    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.004000] Modules linked in:
[    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
[    0.004000]  0000000000000000 0000000000000009 ffffffff813df458 ffff88007abe7e60
[    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 ffffffff00000100
[    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000 000000000000b018
[    0.004000] Call Trace:
[    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
[    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
[    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
[    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
[    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
[    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
[    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
[    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
[    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
[    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
[    0.035371] x86: Booted up 2 nodes, 2 CPUs

The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
with some other acceptable solution.

Incorrect amount of memory for nodes in debug-keys output

    Since the node ranges per domain are saved in guest addresses, the memory
    calculated is incorrect due to the guest e820 memory holes for some nodes.

TODO:
    - some modifications to automatic vnuma placement may be needed;
    - vdistance extended configuration parser will need to be in place;
    - SMT siblings problem (see above) will need a solution;

Changes since v5:
    - reorganized patches;
    - modified domctl hypercall and added locking;
    - added XSM hypercalls with basic policies;
    - verify 32bit compatibility;

Elena Ufimtseva (10):
  xen: vnuma topology and subop hypercalls
  xsm bits for vNUMA hypercalls
  vnuma hook to debug-keys u
  libxc: Introduce xc_domain_setvnuma to set vNUMA
  libxl: vnuma topology configuration parser and doc
  libxc: move code to arch_boot_alloc func
  libxc: allocate domain memory for vnuma enabled domains
  libxl: build numa nodes memory blocks
  libxl: vnuma nodes placement bits
  libxl: set vnuma for domain

 docs/man/xl.cfg.pod.5               |   77 +++++++
 tools/libxc/xc_dom.h                |   11 +
 tools/libxc/xc_dom_x86.c            |   71 +++++-
 tools/libxc/xc_domain.c             |   63 ++++++
 tools/libxc/xenctrl.h               |    9 +
 tools/libxc/xg_private.h            |    1 +
 tools/libxl/libxl.c                 |   22 ++
 tools/libxl/libxl.h                 |   19 ++
 tools/libxl/libxl_create.c          |    1 +
 tools/libxl/libxl_dom.c             |  148 ++++++++++++
 tools/libxl/libxl_internal.h        |   12 +
 tools/libxl/libxl_numa.c            |  193 ++++++++++++++++
 tools/libxl/libxl_types.idl         |    6 +-
 tools/libxl/libxl_vnuma.h           |    8 +
 tools/libxl/libxl_x86.c             |    3 +-
 tools/libxl/xl_cmdimpl.c            |  425 +++++++++++++++++++++++++++++++++++
 xen/arch/x86/numa.c                 |   29 ++-
 xen/common/domain.c                 |   13 ++
 xen/common/domctl.c                 |  167 ++++++++++++++
 xen/common/memory.c                 |   69 ++++++
 xen/include/public/domctl.h         |   29 +++
 xen/include/public/memory.h         |   47 +++-
 xen/include/xen/domain.h            |   11 +
 xen/include/xen/sched.h             |    1 +
 xen/include/xsm/dummy.h             |    6 +
 xen/include/xsm/xsm.h               |    7 +
 xen/xsm/dummy.c                     |    1 +
 xen/xsm/flask/hooks.c               |   10 +
 xen/xsm/flask/policy/access_vectors |    4 +
 29 files changed, 1447 insertions(+), 16 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.h

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 10:30   ` Wei Liu
                     ` (3 more replies)
  2014-07-18  5:50 ` [PATCH v6 02/10] xsm bits for vNUMA hypercalls Elena Ufimtseva
                   ` (11 subsequent siblings)
  12 siblings, 4 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Define interface, structures and hypercalls for toolstack to
build vnuma topology and for guests that wish to retrieve it.
Two subop hypercalls introduced by patch:
XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
and XENMEM_get_vnumainfo to retrieve that topology by guest.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/common/domain.c         |   13 ++++
 xen/common/domctl.c         |  167 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/memory.c         |   62 ++++++++++++++++
 xen/include/public/domctl.h |   29 ++++++++
 xen/include/public/memory.h |   47 +++++++++++-
 xen/include/xen/domain.h    |   11 +++
 xen/include/xen/sched.h     |    1 +
 7 files changed, 329 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index cd64aea..895584a 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -584,6 +584,18 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
     return 0;
 }
 
+void vnuma_destroy(struct vnuma_info *vnuma)
+{
+    if ( vnuma )
+    {
+        xfree(vnuma->vmemrange);
+        xfree(vnuma->vcpu_to_vnode);
+        xfree(vnuma->vdistance);
+        xfree(vnuma->vnode_to_pnode);
+        xfree(vnuma);
+    }
+}
+
 int domain_kill(struct domain *d)
 {
     int rc = 0;
@@ -602,6 +614,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem_client);
+        vnuma_destroy(d->vnuma);
         domain_set_outstanding_pages(d, 0);
         d->tmem_client = NULL;
         /* fallthrough */
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index c326aba..7464284 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -297,6 +297,144 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
             guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
 }
 
+/*
+ * Allocates memory for vNUMA, **vnuma should be NULL.
+ * Caller has to make sure that domain has max_pages
+ * and number of vcpus set for domain.
+ * Verifies that single allocation does not exceed
+ * PAGE_SIZE.
+ */
+static int vnuma_alloc(struct vnuma_info **vnuma,
+                       unsigned int nr_vnodes,
+                       unsigned int nr_vcpus,
+                       unsigned int dist_size)
+{
+    struct vnuma_info *v;
+
+    if ( vnuma && *vnuma )
+        return -EINVAL;
+
+    v = *vnuma;
+    /*
+     * check if any of xmallocs exeeds PAGE_SIZE.
+     * If yes, consider it as an error for now.
+     */
+    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
+        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
+        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
+        dist_size > PAGE_SIZE / sizeof(dist_size) )
+        return -EINVAL;
+
+    v = xzalloc(struct vnuma_info);
+    if ( !v )
+        return -ENOMEM;
+
+    v->vdistance = xmalloc_array(unsigned int, dist_size);
+    v->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
+    v->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
+    v->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
+
+    if ( v->vdistance == NULL || v->vmemrange == NULL ||
+        v->vcpu_to_vnode == NULL || v->vnode_to_pnode == NULL )
+    {
+        vnuma_destroy(v);
+        return -ENOMEM;
+    }
+
+    *vnuma = v;
+
+    return 0;
+}
+
+/*
+ * Allocate memory and construct one vNUMA node,
+ * set default parameters, assign all memory and
+ * vcpus to this node, set distance to 10.
+ */
+static long vnuma_fallback(const struct domain *d,
+                          struct vnuma_info **vnuma)
+{
+    struct vnuma_info *v;
+    long ret;
+
+
+    /* Will not destroy vNUMA here, destroy before calling this. */
+    if ( vnuma && *vnuma )
+        return -EINVAL;
+
+    v = *vnuma;
+    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
+    if ( ret )
+        return ret;
+
+    v->vmemrange[0].start = 0;
+    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
+    v->vdistance[0] = 10;
+    v->vnode_to_pnode[0] = NUMA_NO_NODE;
+    memset(v->vcpu_to_vnode, 0, d->max_vcpus);
+    v->nr_vnodes = 1;
+
+    *vnuma = v;
+
+    return 0;
+}
+
+/*
+ * construct vNUMA topology form u_vnuma struct and return
+ * it in dst.
+ */
+long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
+                const struct domain *d,
+                struct vnuma_info **dst)
+{
+    unsigned int dist_size, nr_vnodes = 0;
+    long ret;
+    struct vnuma_info *v = NULL;
+
+    ret = -EINVAL;
+
+    /* If vNUMA topology already set, just exit. */
+    if ( !u_vnuma || *dst )
+        return ret;
+
+    nr_vnodes = u_vnuma->nr_vnodes;
+
+    if ( nr_vnodes == 0 )
+        return ret;
+
+    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
+        return ret;
+
+    dist_size = nr_vnodes * nr_vnodes;
+
+    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
+    if ( ret )
+        return ret;
+
+    /* On failure, set only one vNUMA node and its success. */
+    ret = 0;
+
+    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
+        goto vnuma_onenode;
+    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
+        goto vnuma_onenode;
+    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
+        d->max_vcpus) )
+        goto vnuma_onenode;
+    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
+        nr_vnodes) )
+        goto vnuma_onenode;
+
+    v->nr_vnodes = nr_vnodes;
+    *dst = v;
+
+    return ret;
+
+vnuma_onenode:
+    vnuma_destroy(v);
+    return vnuma_fallback(d, dst);
+}
+
 long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 {
     long ret = 0;
@@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
     }
     break;
 
+    case XEN_DOMCTL_setvnumainfo:
+    {
+        struct vnuma_info *v = NULL;
+
+        ret = -EFAULT;
+        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
+            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
+            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
+            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
+            return ret;
+
+        ret = -EINVAL;
+
+        ret = vnuma_init(&op->u.vnuma, d, &v);
+        if ( ret < 0 || v == NULL )
+            break;
+
+        /* overwrite vnuma for domain */
+        if ( !d->vnuma )
+            vnuma_destroy(d->vnuma);
+
+        domain_lock(d);
+        d->vnuma = v;
+        domain_unlock(d);
+
+        ret = 0;
+    }
+    break;
+
     default:
         ret = arch_do_domctl(op, d, u_domctl);
         break;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index c2dd31b..925b9fc 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -969,6 +969,68 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         break;
 
+    case XENMEM_get_vnumainfo:
+    {
+        struct vnuma_topology_info topology;
+        struct domain *d;
+        unsigned int dom_vnodes = 0;
+
+        /*
+         * guest passes nr_vnodes and nr_vcpus thus
+         * we know how much memory guest has allocated.
+         */
+        if ( copy_from_guest(&topology, arg, 1) ||
+            guest_handle_is_null(topology.vmemrange.h) ||
+            guest_handle_is_null(topology.vdistance.h) ||
+            guest_handle_is_null(topology.vcpu_to_vnode.h) )
+            return -EFAULT;
+
+        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
+            return -ESRCH;
+
+        rc = -EOPNOTSUPP;
+        if ( d->vnuma == NULL )
+            goto vnumainfo_out;
+
+        if ( d->vnuma->nr_vnodes == 0 )
+            goto vnumainfo_out;
+
+        dom_vnodes = d->vnuma->nr_vnodes;
+
+        /*
+         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
+         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
+         */
+        rc = -ENOBUFS;
+        if ( topology.nr_vnodes < dom_vnodes ||
+            topology.nr_vcpus < d->max_vcpus )
+            goto vnumainfo_out;
+
+        rc = -EFAULT;
+
+        if ( copy_to_guest(topology.vmemrange.h, d->vnuma->vmemrange,
+                           dom_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( copy_to_guest(topology.vdistance.h, d->vnuma->vdistance,
+                           dom_vnodes * dom_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( copy_to_guest(topology.vcpu_to_vnode.h, d->vnuma->vcpu_to_vnode,
+                           d->max_vcpus) != 0 )
+            goto vnumainfo_out;
+
+        topology.nr_vnodes = dom_vnodes;
+
+        if ( copy_to_guest(arg, &topology, 1) != 0 )
+            goto vnumainfo_out;
+        rc = 0;
+
+ vnumainfo_out:
+        rcu_unlock_domain(d);
+        break;
+    }
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 5b11bbf..5ee74f4 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -35,6 +35,7 @@
 #include "xen.h"
 #include "grant_table.h"
 #include "hvm/save.h"
+#include "memory.h"
 
 #define XEN_DOMCTL_INTERFACE_VERSION 0x0000000a
 
@@ -934,6 +935,32 @@ struct xen_domctl_vcpu_msrs {
 };
 typedef struct xen_domctl_vcpu_msrs xen_domctl_vcpu_msrs_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpu_msrs_t);
+
+/*
+ * Use in XEN_DOMCTL_setvnumainfo to set
+ * vNUMA domain topology.
+ */
+struct xen_domctl_vnuma {
+    uint32_t nr_vnodes;
+    uint32_t _pad;
+    XEN_GUEST_HANDLE_64(uint) vdistance;
+    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
+
+    /*
+     * vnodes to physical NUMA nodes mask.
+     * This kept on per-domain basis for
+     * interested consumers, such as numa aware ballooning.
+     */
+    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
+
+    /*
+     * memory rages for each vNUMA node
+     */
+    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
+};
+typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
+
 #endif
 
 struct xen_domctl {
@@ -1008,6 +1035,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_cacheflush                    71
 #define XEN_DOMCTL_get_vcpu_msrs                 72
 #define XEN_DOMCTL_set_vcpu_msrs                 73
+#define XEN_DOMCTL_setvnumainfo                  74
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1068,6 +1096,7 @@ struct xen_domctl {
         struct xen_domctl_cacheflush        cacheflush;
         struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
+        struct xen_domctl_vnuma             vnuma;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 2c57aa0..2c212e1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -521,9 +521,54 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
  * The zero value is appropiate.
  */
 
+/* vNUMA node memory range */
+struct vmemrange {
+    uint64_t start, end;
+};
+
+typedef struct vmemrange vmemrange_t;
+DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
+
+/*
+ * vNUMA topology specifies vNUMA node number, distance table,
+ * memory ranges and vcpu mapping provided for guests.
+ * XENMEM_get_vnumainfo hypercall expects to see from guest
+ * nr_vnodes and nr_vcpus to indicate available memory. After
+ * filling guests structures, nr_vnodes and nr_vcpus copied
+ * back to guest.
+ */
+struct vnuma_topology_info {
+    /* IN */
+    domid_t domid;
+    /* IN/OUT */
+    unsigned int nr_vnodes;
+    unsigned int nr_vcpus;
+    /* OUT */
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vdistance;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vcpu_to_vnode;
+    union {
+        XEN_GUEST_HANDLE(vmemrange_t) h;
+        uint64_t pad;
+    } vmemrange;
+};
+typedef struct vnuma_topology_info vnuma_topology_info_t;
+DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
+
+/*
+ * XENMEM_get_vnumainfo used by guest to get
+ * vNUMA topology from hypervisor.
+ */
+#define XENMEM_get_vnumainfo               26
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
-/* Next available subop number is 26 */
+/* Next available subop number is 27 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index bb1c398..d29a84d 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -89,4 +89,15 @@ extern unsigned int xen_processor_pmbits;
 
 extern bool_t opt_dom0_vcpus_pin;
 
+/* vnuma topology per domain. */
+struct vnuma_info {
+    unsigned int nr_vnodes;
+    unsigned int *vdistance;
+    unsigned int *vcpu_to_vnode;
+    unsigned int *vnode_to_pnode;
+    struct vmemrange *vmemrange;
+};
+
+void vnuma_destroy(struct vnuma_info *vnuma);
+
 #endif /* __XEN_DOMAIN_H__ */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index d5bc461..71e4218 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -447,6 +447,7 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+    struct vnuma_info *vnuma;
 };
 
 struct domain_setup_info
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 02/10] xsm bits for vNUMA hypercalls
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 13:50   ` Konrad Rzeszutek Wilk
  2014-07-18  5:50 ` [PATCH v6 03/10] vnuma hook to debug-keys u Elena Ufimtseva
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Define xsm_get_vnumainfo hypercall used for domain which
wish to receive vnuma topology. Add xsm hook for
XEN_DOMCTL_setvnumainfo. Also adds basic policies.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/common/memory.c                 |    7 +++++++
 xen/include/xsm/dummy.h             |    6 ++++++
 xen/include/xsm/xsm.h               |    7 +++++++
 xen/xsm/dummy.c                     |    1 +
 xen/xsm/flask/hooks.c               |   10 ++++++++++
 xen/xsm/flask/policy/access_vectors |    4 ++++
 6 files changed, 35 insertions(+)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 925b9fc..9a87aa8 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -988,6 +988,13 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
             return -ESRCH;
 
+        rc = xsm_get_vnumainfo(XSM_PRIV, d);
+        if ( rc )
+        {
+            rcu_unlock_domain(d);
+            return rc;
+        }
+
         rc = -EOPNOTSUPP;
         if ( d->vnuma == NULL )
             goto vnumainfo_out;
diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
index c5aa316..4262fd8 100644
--- a/xen/include/xsm/dummy.h
+++ b/xen/include/xsm/dummy.h
@@ -317,6 +317,12 @@ static XSM_INLINE int xsm_set_pod_target(XSM_DEFAULT_ARG struct domain *d)
     return xsm_default_action(action, current->domain, d);
 }
 
+static XSM_INLINE int xsm_get_vnumainfo(XSM_DEFAULT_ARG struct domain *d)
+{
+    XSM_ASSERT_ACTION(XSM_PRIV);
+    return xsm_default_action(action, current->domain, d);
+}
+
 #if defined(HAS_PASSTHROUGH) && defined(HAS_PCI)
 static XSM_INLINE int xsm_get_device_group(XSM_DEFAULT_ARG uint32_t machine_bdf)
 {
diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
index a85045d..c7ec562 100644
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -169,6 +169,7 @@ struct xsm_operations {
     int (*unbind_pt_irq) (struct domain *d, struct xen_domctl_bind_pt_irq *bind);
     int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
     int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
+    int (*get_vnumainfo) (struct domain *d);
 #endif
 };
 
@@ -653,6 +654,12 @@ static inline int xsm_ioport_mapping (xsm_default_t def, struct domain *d, uint3
 {
     return xsm_ops->ioport_mapping(d, s, e, allow);
 }
+
+static inline int xsm_get_vnumainfo (xsm_default_t def, struct domain *d)
+{
+    return xsm_ops->get_vnumainfo(d);
+}
+
 #endif /* CONFIG_X86 */
 
 #endif /* XSM_NO_WRAPPERS */
diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
index c95c803..0826a8b 100644
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -85,6 +85,7 @@ void xsm_fixup_ops (struct xsm_operations *ops)
     set_to_dummy_if_null(ops, iomem_permission);
     set_to_dummy_if_null(ops, iomem_mapping);
     set_to_dummy_if_null(ops, pci_config_permission);
+    set_to_dummy_if_null(ops, get_vnumainfo);
 
 #if defined(HAS_PASSTHROUGH) && defined(HAS_PCI)
     set_to_dummy_if_null(ops, get_device_group);
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index f2f59ea..00efba1 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -404,6 +404,11 @@ static int flask_claim_pages(struct domain *d)
     return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__SETCLAIM);
 }
 
+static int flask_get_vnumainfo(struct domain *d)
+{
+    return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__GET_VNUMAINFO);
+}
+
 static int flask_console_io(struct domain *d, int cmd)
 {
     u32 perm;
@@ -715,6 +720,9 @@ static int flask_domctl(struct domain *d, int cmd)
     case XEN_DOMCTL_cacheflush:
         return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__CACHEFLUSH);
 
+    case XEN_DOMCTL_setvnumainfo:
+        return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN2__SET_VNUMAINFO);
+
     default:
         printk("flask_domctl: Unknown op %d\n", cmd);
         return -EPERM;
@@ -1552,6 +1560,8 @@ static struct xsm_operations flask_ops = {
     .hvm_param_nested = flask_hvm_param_nested,
 
     .do_xsm_op = do_flask_op,
+    .get_vnumainfo = flask_get_vnumainfo,
+
 #ifdef CONFIG_COMPAT
     .do_compat_op = compat_flask_op,
 #endif
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 32371a9..d279841 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -200,6 +200,10 @@ class domain2
     cacheflush
 # Creation of the hardware domain when it is not dom0
     create_hardware_domain
+# XEN_DOMCTL_setvnumainfo
+    set_vnumainfo
+# XENMEM_getvnumainfo
+    get_vnumainfo
 }
 
 # Similar to class domain, but primarily contains domctls related to HVM domains
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 03/10] vnuma hook to debug-keys u
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
  2014-07-18  5:50 ` [PATCH v6 02/10] xsm bits for vNUMA hypercalls Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-23 14:10   ` Jan Beulich
  2014-07-18  5:50 ` [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA Elena Ufimtseva
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Add debug-keys hook to display vnuma topology.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/arch/x86/numa.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index b141877..8153ec7 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -347,7 +347,7 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
 	s_time_t now = NOW();
-	int i;
+	int i, j, n, err;
 	struct domain *d;
 	struct page_info *page;
 	unsigned int page_num_node[MAX_NUMNODES];
@@ -389,6 +389,33 @@ static void dump_numa(unsigned char key)
 
 		for_each_online_node(i)
 			printk("    Node %u: %u\n", i, page_num_node[i]);
+
+		if ( d->vnuma ) {
+			printk("    Domain has %u vnodes, %u vcpus\n", d->vnuma->nr_vnodes, d->max_vcpus);
+			for ( i = 0; i < d->vnuma->nr_vnodes; i++ ) {
+				err = snprintf(keyhandler_scratch, 12, "%u", d->vnuma->vnode_to_pnode[i]);
+				if ( err < 0 )
+					printk("        vnode %u - pnode %s,", i, "any");
+				else
+					printk("        vnode %u - pnode %s,", i,
+				d->vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ? "any" : keyhandler_scratch);
+				printk(" %"PRIu64" MB, ",
+					(d->vnuma->vmemrange[i].end - d->vnuma->vmemrange[i].start) >> 20);
+				printk("vcpu nums: ");
+				for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
+					if ( d->vnuma->vcpu_to_vnode[j] == i ) {
+						if ( ((n + 1) % 8) == 0 )
+							printk("%d\n", j);
+						else if ( !(n % 8) && n != 0 )
+							printk("%s%d ", "             ", j);
+						else
+							printk("%d ", j);
+					n++;
+					}
+				}
+				printk("\n");
+			}
+		}
 	}
 
 	rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (2 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 03/10] vnuma hook to debug-keys u Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 10:33   ` Wei Liu
  2014-07-29 10:33   ` Ian Campbell
  2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

With the introduction of the XEN_DOMCTL_setvnumainfo
in patch titled: "xen: vnuma topology and subop hypercalls"
we put in the plumbing here to use from the toolstack. The user
is allowed to call this multiple times if they wish so.
It will error out if the nr_vnodes or nr_vcpus is zero.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_domain.c |   63 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h   |    9 +++++++
 2 files changed, 72 insertions(+)

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 0230c6c..a5625b5 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2123,6 +2123,69 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t domid,
     return do_domctl(xch, &domctl);
 }
 
+/* Plumbing Xen with vNUMA topology */
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
+                                    nr_vnodes * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
+                                    nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    if ( nr_vnodes == 0 ) {
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( !vdistance || !vcpu_to_vnode || !vmemrange || !vnode_to_pnode ) {
+        PERROR("%s: Cant set vnuma without initializing topology", __func__);
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
+         xc_hypercall_bounce_pre(xch, vdistance)      ||
+         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
+         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
+        PERROR("%s: Could not bounce buffers!", __func__);
+        errno = EFAULT;
+        return -1;
+    }
+
+    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
+    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
+    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode);
+    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode);
+
+    domctl.cmd = XEN_DOMCTL_setvnumainfo;
+    domctl.domain = (domid_t)domid;
+    domctl.u.vnuma.nr_vnodes = nr_vnodes;
+
+    rc = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, vmemrange);
+    xc_hypercall_bounce_post(xch, vdistance);
+    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
+    xc_hypercall_bounce_post(xch, vnode_to_pnode);
+
+    if ( rc )
+        errno = EFAULT;
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 3578b09..81f173b 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -1235,6 +1235,15 @@ int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
                                unsigned long map_limitkb);
 
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode);
+
 #if defined(__i386__) || defined(__x86_64__)
 /*
  * PC BIOS standard E820 types and structure.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (3 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 10:53   ` Wei Liu
                     ` (2 more replies)
  2014-07-18  5:50 ` [PATCH v6 06/10] libxc: move code to arch_boot_alloc func Elena Ufimtseva
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Parses vnuma topoplogy number of nodes and memory
ranges. If not defined, initializes vnuma with
only one node and default topology. This one node covers
all domain memory and all vcpus assigned to it.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 docs/man/xl.cfg.pod.5       |   77 ++++++++
 tools/libxl/libxl_types.idl |    6 +-
 tools/libxl/libxl_vnuma.h   |    8 +
 tools/libxl/xl_cmdimpl.c    |  425 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 515 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_vnuma.h

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index ff9ea77..0c7fbf8 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -242,6 +242,83 @@ if the values of B<memory=> and B<maxmem=> differ.
 A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver
 it will crash.
 
+=item B<vnuma_nodes=N>
+
+Number of vNUMA nodes the guest will be initialized with on boot.
+PV guest by default will have one vnuma node.
+
+=item B<vnuma_mem=[vmem1, vmem2, ...]>
+
+List of memory sizes for each node, defined in MBytes. Number of items listed must 
+match nr_vnodes. If the sum of all vnode memories does not match the domain memory
+or there are missing nodes, it will fail.
+If not specified, memory will be equally split between vnodes. Current minimum
+memory size for one node is limited by 32MB.
+
+Example: vnuma_mem=[1024, 1024, 2048, 2048]
+Total amount of memory in guest: 6GB
+
+=item B<vdistance=[d1, d2]>
+
+Defines the distance table for vNUMA nodes. NUMA topology distances are
+represented by two dimensional square matrix. One element of it [i,j] is
+a distance between nodes i and j. Trivial case is where all diagonal elements
+are equal and matrix is symmetrical. vdistance configuration option allows
+to define two values d1 and d2. d1 will be used for all diagonal elements of
+distance matrix. All other values will be equal to d2 value. Usually distances
+are multiple of 10 in Linux and same rule used here.
+If not specified, the default constants values will be used for distance,
+e.g. [10, 20]. For one node default distance is [10];
+
+Examples:
+vnodes = 3
+vdistance=[10, 20]
+will create this distance table (this is default setting as well):
+[10, 20, 20]
+[20, 10, 20]
+[20, 20, 10]
+
+=item B<vnuma_vcpumap=[node_nr, node_nr, ...]>
+
+Defines vcpu to vnode mapping as a list of integers. The position in the list
+is a vcpu number, and the value is the vnode number to which the vcpu will be
+assigned to.
+Current limitations:
+- vNUMA node must have at least one vcpu, otherwise default vcpu_to_vnode will be used.
+- Total number of vnodes cannot be bigger then number of vcpus.
+
+Example:
+Map of 4 vcpus to 2 vnodes:
+0,1 vcpu -> vnode0
+2,3 vcpu -> vnode1:
+
+vnuma_vcpumap = [0, 0, 1, 1]
+ 4 vcpus here -  0  1  2  3
+
+=item B<vnuma_vnodemap=[p1, p2, ..., pn]>
+
+List of physical node numbers, position in the list represents vnode number.
+Used for manual placement of vnuma nodes to physical NUMA nodes.
+Will not be used if automatic numa placement is active.
+
+Example:
+assume NUMA machine with 4 physical nodes. Placing vnuma node 0 to pnode 2,
+vnuma node 1 to pnode 3:
+vnode0 -> pnode2
+vnode1 -> pnode3
+
+vnuma_vnodemap=[2, 3]
+first vnode will be placed on node 2, second on node 3.
+
+=item B<vnuma_autoplacement=[0|1]>
+
+If set to 1 and automatic NUMA placement is enabled, automatically will find the best
+physical node to place vnuma nodes on. vnuma_vnodemap will be ignored. Automatic NUMA
+placement is enabled if domain has no pinned cpus.
+If vnuma_autoplacement is set to 0, then the vnodes will be placed on NUMA nodes set
+in vnuma_vnodemap if there is enough memory on physical nodes. If not, then the allocation
+will be made on any of the available node and be placed on multiple physical NUMA nodes.
+
 =back
 
 =head3 Event Actions
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index de25f42..5876822 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -318,7 +318,11 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
-    
+    ("vnuma_mem",     Array(uint64, "nr_nodes")),
+    ("vnuma_vcpumap",     Array(uint32, "nr_nodemap")),
+    ("vdistance",        Array(uint32, "nr_dist")),
+    ("vnuma_vnodemap",  Array(uint32, "nr_node_to_pnode")),
+    ("vnuma_autoplacement",  libxl_defbool),
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
     # if you set device_model you must set device_model_version too
diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
new file mode 100644
index 0000000..4ff4c57
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.h
@@ -0,0 +1,8 @@
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#define VNUMA_NO_NODE ~((unsigned int)0)
+
+/* Max vNUMA node size from Linux. */
+#define MIN_VNODE_SIZE  32U
+
+#define MAX_VNUMA_NODES (unsigned int)1 << 10
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 68df548..5d91c2c 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -40,6 +40,7 @@
 #include "libxl_json.h"
 #include "libxlutil.h"
 #include "xl.h"
+#include "libxl_vnuma.h"
 
 /* For calls which return an errno on failure */
 #define CHK_ERRNOVAL( call ) ({                                         \
@@ -690,6 +691,423 @@ static void parse_top_level_sdl_options(XLU_Config *config,
     xlu_cfg_replace_string (config, "xauthority", &sdl->xauthority, 0);
 }
 
+
+static unsigned int get_list_item_uint(XLU_ConfigList *list, unsigned int i)
+{
+    const char *buf;
+    char *ep;
+    unsigned long ul;
+    int rc = -EINVAL;
+
+    buf = xlu_cfg_get_listitem(list, i);
+    if (!buf)
+        return rc;
+    ul = strtoul(buf, &ep, 10);
+    if (ep == buf)
+        return rc;
+    if (ul >= UINT16_MAX)
+        return rc;
+    return (unsigned int)ul;
+}
+
+static void vdistance_set(unsigned int *vdistance,
+                                unsigned int nr_vnodes,
+                                unsigned int samenode,
+                                unsigned int othernode)
+{
+    unsigned int idx, slot;
+    for (idx = 0; idx < nr_vnodes; idx++)
+        for (slot = 0; slot < nr_vnodes; slot++)
+            *(vdistance + slot * nr_vnodes + idx) =
+                idx == slot ? samenode : othernode;
+}
+
+static void vcputovnode_default(unsigned int *cpu_to_node,
+                                unsigned int nr_vnodes,
+                                unsigned int max_vcpus)
+{
+    unsigned int cpu;
+    for (cpu = 0; cpu < max_vcpus; cpu++)
+        cpu_to_node[cpu] = cpu % nr_vnodes;
+}
+
+/* Split domain memory between vNUMA nodes equally. */
+static int split_vnumamem(libxl_domain_build_info *b_info)
+{
+    unsigned long long vnodemem = 0;
+    unsigned long n;
+    unsigned int i;
+
+    if (b_info->nr_nodes == 0)
+        return -1;
+
+    vnodemem = (b_info->max_memkb >> 10) / b_info->nr_nodes;
+    if (vnodemem < MIN_VNODE_SIZE)
+        return -1;
+    /* reminder in MBytes. */
+    n = (b_info->max_memkb >> 10) % b_info->nr_nodes;
+    /* get final sizes in MBytes. */
+    for (i = 0; i < (b_info->nr_nodes - 1); i++)
+        b_info->vnuma_mem[i] = vnodemem;
+    /* add the reminder to the last node. */
+    b_info->vnuma_mem[i] = vnodemem + n;
+    return 0;
+}
+
+static void vnuma_vnodemap_default(unsigned int *vnuma_vnodemap,
+                                   unsigned int nr_vnodes)
+{
+    unsigned int i;
+    for (i = 0; i < nr_vnodes; i++)
+        vnuma_vnodemap[i] = VNUMA_NO_NODE;
+}
+
+/*
+ * init vNUMA to "zero config" with one node and all other
+ * topology parameters set to default.
+ */
+static int vnuma_zero_config(libxl_domain_build_info *b_info)
+{
+    b_info->nr_nodes = 1;
+    /* all memory goes to this one vnode, as well as vcpus. */
+    if (!(b_info->vnuma_mem = (uint64_t *)calloc(b_info->nr_nodes,
+                                sizeof(*b_info->vnuma_mem))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->vnuma_vcpumap = (unsigned int *)calloc(b_info->max_vcpus,
+                                sizeof(*b_info->vnuma_vcpumap))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->vdistance = (unsigned int *)calloc(b_info->nr_nodes *
+                                b_info->nr_nodes, sizeof(*b_info->vdistance))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->vnuma_vnodemap = (unsigned int *)calloc(b_info->nr_nodes,
+                                sizeof(*b_info->vnuma_vnodemap))))
+        goto bad_vnumazerocfg;
+
+    b_info->vnuma_mem[0] = b_info->max_memkb >> 10;
+
+    /* all vcpus assigned to this vnode. */
+    vcputovnode_default(b_info->vnuma_vcpumap, b_info->nr_nodes,
+                        b_info->max_vcpus);
+
+    /* default vdistance is 10. */
+    vdistance_set(b_info->vdistance, b_info->nr_nodes, 10, 10);
+
+    /* VNUMA_NO_NODE for vnode_to_pnode. */
+    vnuma_vnodemap_default(b_info->vnuma_vnodemap, b_info->nr_nodes);
+
+    /*
+     * will be placed to some physical nodes defined by automatic
+     * numa placement or VNUMA_NO_NODE will not request exact node.
+     */
+    libxl_defbool_set(&b_info->vnuma_autoplacement, true);
+    return 0;
+
+ bad_vnumazerocfg:
+    return -1;
+}
+
+static void free_vnuma_info(libxl_domain_build_info *b_info)
+{
+    free(b_info->vnuma_mem);
+    free(b_info->vdistance);
+    free(b_info->vnuma_vcpumap);
+    free(b_info->vnuma_vnodemap);
+    b_info->nr_nodes = 0;
+}
+
+static int parse_vnuma_mem(XLU_Config *config,
+                            libxl_domain_build_info **b_info)
+{
+    libxl_domain_build_info *dst;
+    XLU_ConfigList *vnumamemcfg;
+    int nr_vnuma_regions, i;
+    unsigned long long vnuma_memparsed = 0;
+    unsigned long ul;
+    const char *buf;
+
+    dst = *b_info;
+    if (!xlu_cfg_get_list(config, "vnuma_mem",
+                          &vnumamemcfg, &nr_vnuma_regions, 0)) {
+
+        if (nr_vnuma_regions != dst->nr_nodes) {
+            fprintf(stderr, "Number of numa regions (vnumamem = %d) is \
+                    incorrect (should be %d).\n", nr_vnuma_regions,
+                    dst->nr_nodes);
+            goto bad_vnuma_mem;
+        }
+
+        dst->vnuma_mem = calloc(dst->nr_nodes,
+                                 sizeof(*dst->vnuma_mem));
+        if (dst->vnuma_mem == NULL) {
+            fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
+            goto bad_vnuma_mem;
+        }
+
+        char *ep;
+        /*
+         * Will parse only nr_vnodes times, even if we have more/less regions.
+         * Take care of it later if less or discard if too many regions.
+         */
+        for (i = 0; i < dst->nr_nodes; i++) {
+            buf = xlu_cfg_get_listitem(vnumamemcfg, i);
+            if (!buf) {
+                fprintf(stderr,
+                        "xl: Unable to get element %d in vnuma memory list.\n", i);
+                if (vnuma_zero_config(dst))
+                    goto bad_vnuma_mem;
+
+            }
+            ul = strtoul(buf, &ep, 10);
+            if (ep == buf) {
+                fprintf(stderr, "xl: Invalid argument parsing vnumamem: %s.\n", buf);
+                if (vnuma_zero_config(dst))
+                    goto bad_vnuma_mem;
+            }
+
+            /* 32Mb is a min size for a node, taken from Linux */
+            if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) {
+                fprintf(stderr, "xl: vnuma memory %lu is not within %u - %u range.\n",
+                        ul, MIN_VNODE_SIZE, UINT32_MAX);
+                if (vnuma_zero_config(dst))
+                    goto bad_vnuma_mem;
+            }
+
+            /* memory in MBytes */
+            dst->vnuma_mem[i] = ul;
+        }
+
+        /* Total memory for vNUMA parsed to verify */
+        for (i = 0; i < nr_vnuma_regions; i++)
+            vnuma_memparsed = vnuma_memparsed + (dst->vnuma_mem[i]);
+
+        /* Amount of memory for vnodes same as total? */
+        if ((vnuma_memparsed << 10) != (dst->max_memkb)) {
+            fprintf(stderr, "xl: vnuma memory is not the same as domain \
+                    memory size.\n");
+            goto bad_vnuma_mem;
+        }
+    } else {
+        dst->vnuma_mem = calloc(dst->nr_nodes,
+                                      sizeof(*dst->vnuma_mem));
+        if (dst->vnuma_mem == NULL) {
+            fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
+            goto bad_vnuma_mem;
+        }
+
+        fprintf(stderr, "WARNING: vNUMA memory ranges were not specified.\n");
+        fprintf(stderr, "Using default equal vnode memory size %lu Kbytes \
+                to cover %lu Kbytes.\n",
+                dst->max_memkb / dst->nr_nodes, dst->max_memkb);
+
+        if (split_vnumamem(dst) < 0) {
+            fprintf(stderr, "Could not split vnuma memory into equal chunks.\n");
+            goto bad_vnuma_mem;
+        }
+    }
+    return 0;
+
+ bad_vnuma_mem:
+    return -1;
+}
+
+static int parse_vnuma_distance(XLU_Config *config,
+                                libxl_domain_build_info **b_info)
+{
+    libxl_domain_build_info *dst;
+    XLU_ConfigList *vdistancecfg;
+    int nr_vdist;
+
+    dst = *b_info;
+    dst->vdistance = calloc(dst->nr_nodes * dst->nr_nodes,
+                               sizeof(*dst->vdistance));
+    if (dst->vdistance == NULL)
+        goto bad_distance;
+
+    if (!xlu_cfg_get_list(config, "vdistance", &vdistancecfg, &nr_vdist, 0)) {
+        int d1, d2;
+        /*
+         * First value is the same node distance, the second as the
+         * rest of distances. The following is required right now to
+         * avoid non-symmetrical distance table as it may break latest kernel.
+         * TODO: Better way to analyze extended distance table, possibly
+         * OS specific.
+         */
+         d1 = get_list_item_uint(vdistancecfg, 0);
+         if (dst->nr_nodes > 1)
+            d2 = get_list_item_uint(vdistancecfg, 1);
+         else
+            d2 = d1;
+
+         if (d1 >= 0 && d2 >= 0) {
+            if (d1 < d2)
+                fprintf(stderr, "WARNING: vnuma distance d1 < d2, %u < %u\n", d1, d2);
+            vdistance_set(dst->vdistance, dst->nr_nodes, d1, d2);
+         } else {
+            fprintf(stderr, "WARNING: vnuma distance values are incorrect.\n");
+            goto bad_distance;
+         }
+
+    } else {
+        fprintf(stderr, "Could not parse vnuma distances.\n");
+        vdistance_set(dst->vdistance, dst->nr_nodes, 10, 20);
+    }
+    return 0;
+
+ bad_distance:
+    return -1;
+}
+
+static int parse_vnuma_vcpumap(XLU_Config *config,
+                                libxl_domain_build_info **b_info)
+{
+    libxl_domain_build_info *dst;
+    XLU_ConfigList *vcpumap;
+    int nr_vcpumap, i;
+
+    dst = *b_info;
+    dst->vnuma_vcpumap = (unsigned int *)calloc(dst->max_vcpus,
+                                     sizeof(*dst->vnuma_vcpumap));
+    if (dst->vnuma_vcpumap == NULL)
+        goto bad_vcpumap;
+
+    if (!xlu_cfg_get_list(config, "vnuma_vcpumap",
+                          &vcpumap, &nr_vcpumap, 0)) {
+        if (nr_vcpumap == dst->max_vcpus) {
+            unsigned int  vnode, vcpumask = 0, vmask;
+            vmask = ~(~0 << nr_vcpumap);
+            for (i = 0; i < nr_vcpumap; i++) {
+                vnode = get_list_item_uint(vcpumap, i);
+                if (vnode >= 0 && vnode < dst->nr_nodes) {
+                    vcpumask  |= (1 << i);
+                    dst->vnuma_vcpumap[i] = vnode;
+                }
+            }
+
+            /* Did it covered all vnodes in the vcpu mask? */
+            if ( !(((vmask & vcpumask) + 1) == (1 << nr_vcpumap)) ) {
+                fprintf(stderr, "WARNING: Not all vnodes were covered \
+                        in numa_cpumask.\n");
+                goto bad_vcpumap;
+            }
+        } else {
+            fprintf(stderr, "WARNING:  Bad vnuma_vcpumap.\n");
+            goto bad_vcpumap;
+        }
+    }
+    else
+        vcputovnode_default(dst->vnuma_vcpumap,
+                            dst->nr_nodes,
+                            dst->max_vcpus);
+    return 0;
+
+ bad_vcpumap:
+    return -1;
+}
+
+static int parse_vnuma_vnodemap(XLU_Config *config,
+                                libxl_domain_build_info **b_info)
+{
+    libxl_domain_build_info *dst;
+    XLU_ConfigList *vnodemap;
+    int nr_vnodemap, i;
+
+    dst = *b_info;
+
+    /* There is mapping to NUMA physical nodes? */
+    dst->vnuma_vnodemap = (unsigned int *)calloc(dst->nr_nodes,
+                                    sizeof(*dst->vnuma_vnodemap));
+    if (dst->vnuma_vnodemap == NULL)
+        goto bad_vnodemap;
+    if (!xlu_cfg_get_list(config, "vnuma_vnodemap",&vnodemap,
+                                            &nr_vnodemap, 0)) {
+        /*
+        * If not specified or incorred, will be defined
+        * later based on the machine architecture, configuration
+        * and memory availble when creating domain.
+        */
+        libxl_defbool_set(&dst->vnuma_autoplacement, false);
+        if (nr_vnodemap == dst->nr_nodes) {
+            unsigned int vnodemask = 0, pnode, smask;
+            smask = ~(~0 << dst->nr_nodes);
+            for (i = 0; i < dst->nr_nodes; i++) {
+                pnode = get_list_item_uint(vnodemap, i);
+                if (pnode >= 0) {
+                    vnodemask |= (1 << i);
+                    dst->vnuma_vnodemap[i] = pnode;
+                }
+            }
+
+            /* Did it covered all vnodes in the mask? */
+            if ( !(((vnodemask & smask) + 1) == (1 << nr_vnodemap)) ) {
+                fprintf(stderr, "WARNING: Not all vnodes were covered \
+                        vnuma_vnodemap.\n");
+                fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+                libxl_defbool_set(&dst->vnuma_autoplacement, true);
+                vnuma_vnodemap_default(dst->vnuma_vnodemap, dst->nr_nodes);
+            }
+        }
+        else {
+            fprintf(stderr, "WARNING: Incorrect vnuma_vnodemap.\n");
+            fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+            libxl_defbool_set(&dst->vnuma_autoplacement, true);
+            vnuma_vnodemap_default(dst->vnuma_vnodemap, dst->nr_nodes);
+        }
+    }
+    else {
+        fprintf(stderr, "WARNING: Missing vnuma_vnodemap.\n");
+        fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+        libxl_defbool_set(&dst->vnuma_autoplacement, true);
+        vnuma_vnodemap_default(dst->vnuma_vnodemap, dst->nr_nodes);
+    }
+    return 0;
+
+ bad_vnodemap:
+    return -1;
+
+}
+
+static void parse_vnuma_config(XLU_Config *config,
+                               libxl_domain_build_info *b_info)
+{
+    long l;
+
+    if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) {
+        if (l > MAX_VNUMA_NODES) {
+            fprintf(stderr, "Too many vnuma nodes, max %d is allowed.\n",
+                    MAX_VNUMA_NODES);
+            goto bad_vnuma_config;
+        }
+        b_info->nr_nodes = l;
+
+        if (!xlu_cfg_get_defbool(config, "vnuma_autoplacement",
+                    &b_info->vnuma_autoplacement, 0))
+            libxl_defbool_set(&b_info->vnuma_autoplacement, false);
+
+        /* Only construct nodes with at least one vcpu for now */
+        if (b_info->nr_nodes != 0 && b_info->max_vcpus >= b_info->nr_nodes) {
+
+           if (parse_vnuma_mem(config, &b_info) ||
+                parse_vnuma_distance(config, &b_info) ||
+                parse_vnuma_vcpumap(config, &b_info) ||
+                parse_vnuma_vnodemap(config, &b_info))
+                goto bad_vnuma_config;
+        }
+        else if (vnuma_zero_config(b_info))
+            goto bad_vnuma_config;
+    }
+    /* If vnuma topology is not defined for domain, init one node */
+    else if (vnuma_zero_config(b_info))
+            goto bad_vnuma_config;
+    return;
+
+ bad_vnuma_config:
+    free_vnuma_info(b_info);
+    exit(1);
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -1021,6 +1439,13 @@ static void parse_config_data(const char *config_source,
             exit(1);
         }
 
+
+        /*
+         * If there is no vnuma in config, "zero" vnuma config
+         * will be initialized with one node and other defaults.
+         */
+        parse_vnuma_config(config, b_info);
+
         xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0);
         switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args",
                                       &b_info->u.pv.bootloader_args, 1))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 06/10] libxc: move code to arch_boot_alloc func
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (4 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-29 10:38   ` Ian Campbell
  2014-07-18  5:50 ` [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled Elena Ufimtseva
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

No functional changes, just moving code.
Prepare for next patch "libxc: allocate
domain memory for vnuma enabled domains"

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_dom.h     |    1 +
 tools/libxc/xc_dom_x86.c |   39 +++++++++++++++++++++++++--------------
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index 6ae6a9f..71a3701 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -385,6 +385,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image *dom,
 int arch_setup_meminit(struct xc_dom_image *dom);
 int arch_setup_bootearly(struct xc_dom_image *dom);
 int arch_setup_bootlate(struct xc_dom_image *dom);
+int arch_boot_alloc(struct xc_dom_image *dom);
 
 /*
  * Local variables:
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index bf06fe4..40d3408 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -756,10 +756,30 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
     return rc;
 }
 
+int arch_boot_alloc(struct xc_dom_image *dom)
+{
+        int rc = 0;
+        xen_pfn_t allocsz, i;
+
+        /* allocate guest memory */
+        for ( i = rc = allocsz = 0;
+              (i < dom->total_pages) && !rc;
+              i += allocsz )
+        {
+            allocsz = dom->total_pages - i;
+            if ( allocsz > 1024*1024 )
+                allocsz = 1024*1024;
+            rc = xc_domain_populate_physmap_exact(
+                dom->xch, dom->guest_domid, allocsz,
+                0, 0, &dom->p2m_host[i]);
+        }
+        return rc;
+}
+
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
-    xen_pfn_t pfn, allocsz, i, j, mfn;
+    xen_pfn_t pfn, i, j, mfn;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -811,19 +831,10 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
-        
-        /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
-        {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
-        }
+
+        rc = arch_boot_alloc(dom);
+        if ( rc )
+            return rc;
 
         /* Ensure no unclaimed pages are left unused.
          * OK to call if hadn't done the earlier claim call. */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (5 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 06/10] libxc: move code to arch_boot_alloc func Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-29 10:43   ` Ian Campbell
  2014-07-18  5:50 ` [PATCH v6 08/10] libxl: build numa nodes memory blocks Elena Ufimtseva
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

vNUMA-aware domain memory allocation based on provided
vnode to pnode map. If this map is not defined, use
default allocation. Default allocation will not specify
any physical node when allocating memory.
Domain creation will fail if at least one node was not defined.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_dom.h     |   10 ++++++
 tools/libxc/xc_dom_x86.c |   76 ++++++++++++++++++++++++++++++++++------------
 tools/libxc/xg_private.h |    1 +
 3 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index 71a3701..4cc08bc 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -164,6 +164,16 @@ struct xc_dom_image {
 
     /* kernel loader */
     struct xc_dom_arch *arch_hooks;
+
+   /*
+    * vNUMA topology and memory allocation structure.
+    * Defines the way to allocate memory on per NUMA
+    * physical defined by vnode_to_pnode.
+    */
+    uint32_t nr_nodes;
+    uint64_t *numa_memszs;
+    unsigned int *vnode_to_pnode;
+
     /* allocate up to virt_alloc_end */
     int (*allocate) (struct xc_dom_image * dom, xen_vaddr_t up_to);
 };
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index 40d3408..23267ed 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -756,26 +756,6 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
     return rc;
 }
 
-int arch_boot_alloc(struct xc_dom_image *dom)
-{
-        int rc = 0;
-        xen_pfn_t allocsz, i;
-
-        /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
-        {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
-        }
-        return rc;
-}
-
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
@@ -832,6 +812,13 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
 
+        /* allocate guest memory */
+        if ( dom->nr_nodes == 0 ) {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                         "%s: Cannot allocate domain memory for 0 vnodes\n",
+                         __FUNCTION__);
+            return -EINVAL;
+        }
         rc = arch_boot_alloc(dom);
         if ( rc )
             return rc;
@@ -841,6 +828,55 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         (void)xc_domain_claim_pages(dom->xch, dom->guest_domid,
                                     0 /* cancels the claim */);
     }
+    return rc;
+}
+
+/*
+ * Allocates domain memory taking into account
+ * defined vnuma topology and vnode_to_pnode map.
+ * Any pv guest will have at least one vnuma node
+ * with vnuma_memszs[0] = domain memory and the rest
+ * topology initialized with default values.
+ */
+int arch_boot_alloc(struct xc_dom_image *dom)
+{
+    int rc;
+    unsigned int n, memflags;
+    unsigned long long vnode_pages;
+    unsigned long long allocsz = 0, node_pfn_base, i;
+
+    rc = allocsz = node_pfn_base = 0;
+
+    allocsz = 0;
+    for ( n = 0; n < dom->nr_nodes; n++ )
+    {
+        memflags = 0;
+        if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE )
+        {
+            memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]);
+            memflags |= XENMEMF_exact_node_request;
+        }
+        /* memeszs are in megabytes, calc pages from it for this node. */
+        vnode_pages = (dom->numa_memszs[n] << 20) >> PAGE_SHIFT_X86;
+        for ( i = 0; i < vnode_pages; i += allocsz )
+        {
+            allocsz = vnode_pages - i;
+            if ( allocsz > 1024*1024 )
+                allocsz = 1024*1024;
+
+            rc = xc_domain_populate_physmap_exact(dom->xch, dom->guest_domid,
+                                            allocsz, 0, memflags,
+                                            &dom->p2m_host[node_pfn_base + i]);
+            if ( rc )
+            {
+                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                        "%s: Failed allocation of %Lu pages for vnode %d on pnode %d out of %lu\n",
+                        __FUNCTION__, vnode_pages, n, dom->vnode_to_pnode[n], dom->total_pages);
+                return rc;
+            }
+        }
+        node_pfn_base += i;
+    }
 
     return rc;
 }
diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
index e593364..21e4a20 100644
--- a/tools/libxc/xg_private.h
+++ b/tools/libxc/xg_private.h
@@ -123,6 +123,7 @@ typedef uint64_t l4_pgentry_64_t;
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
 #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
 
+#define VNUMA_NO_NODE ~((unsigned int)0)
 
 /* XXX SMH: following skanky macros rely on variable p2m_size being set */
 /* XXX TJD: also, "guest_width" should be the guest's sizeof(unsigned long) */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 08/10] libxl: build numa nodes memory blocks
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (6 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 11:01   ` Wei Liu
  2014-07-18  5:50 ` [PATCH v6 09/10] libxl: vnuma nodes placement bits Elena Ufimtseva
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Create the vmemrange structure based on the
PV guests E820 map. Values are in in Megabytes.
Also export the E820 filter code e820_sanitize
out to be available internally.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_internal.h |   12 +++
 tools/libxl/libxl_numa.c     |  193 ++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_x86.c      |    3 +-
 3 files changed, 207 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index e8f2abb..80f81cd 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3086,6 +3086,18 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
     libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
 }
 
+bool libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info);
+
+int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+                                  uint32_t *nr_entries,
+                                  unsigned long map_limitkb,
+                                  unsigned long balloon_kb);
+
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                                     uint32_t domid,
+                                     struct libxl_domain_build_info *b_info,
+                                     vmemrange_t *memblks);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 94ca4fe..755dc19 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -19,6 +19,8 @@
 
 #include "libxl_internal.h"
 
+#include "libxl_vnuma.h"
+
 /*
  * What follows are helpers for generating all the k-combinations
  * without repetitions of a set S with n elements in it. Formally
@@ -508,6 +510,197 @@ int libxl__get_numa_candidate(libxl__gc *gc,
 }
 
 /*
+ * Check if we can fit vnuma nodes to numa pnodes
+ * from vnode_to_pnode array.
+ */
+bool libxl__vnodemap_is_usable(libxl__gc *gc,
+                            libxl_domain_build_info *info)
+{
+    unsigned int i;
+    libxl_numainfo *ninfo = NULL;
+    unsigned long long *claim;
+    unsigned int node;
+    uint64_t *sz_array;
+    int nr_nodes = 0;
+
+    /* Cannot use specified mapping if not NUMA machine. */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return false;
+
+    sz_array = info->vnuma_mem;
+    claim = libxl__calloc(gc, info->nr_nodes, sizeof(*claim));
+    /* Get total memory required on each physical node. */
+    for (i = 0; i < info->nr_nodes; i++)
+    {
+        node = info->vnuma_vnodemap[i];
+
+        if (node < nr_nodes)
+            claim[node] += (sz_array[i] << 20);
+        else
+            goto vnodemapout;
+   }
+   for (i = 0; i < nr_nodes; i++) {
+       if (claim[i] > ninfo[i].free)
+          /* Cannot complete user request, falling to default. */
+          goto vnodemapout;
+   }
+
+ vnodemapout:
+   return true;
+}
+
+/*
+ * Returns number of absent pages within e820 map
+ * between start and end addresses passed. Needed
+ * to correctly set numa memory ranges for domain.
+ */
+static unsigned long e820_memory_hole_size(unsigned long start,
+                                            unsigned long end,
+                                            struct e820entry e820[],
+                                            unsigned int nr)
+{
+    unsigned int i;
+    unsigned long absent, start_blk, end_blk;
+
+    /* init absent number of pages with all memmap size. */
+    absent = end - start;
+    for (i = 0; i < nr; i++) {
+        /* if not E820_RAM region, skip it. */
+        if (e820[i].type == E820_RAM) {
+            start_blk = e820[i].addr;
+            end_blk = e820[i].addr + e820[i].size;
+            /* beginning address is within this region? */
+            if (start >= start_blk && start <= end_blk) {
+                if (end > end_blk)
+                    absent -= end_blk - start;
+                else
+                    /* fit the region? then no absent pages. */
+                    absent -= end - start;
+                continue;
+            }
+            /* found the end of range in this region? */
+            if (end <= end_blk && end >= start_blk) {
+                absent -= end - start_blk;
+                /* no need to look for more ranges. */
+                break;
+            }
+        }
+    }
+    return absent;
+}
+
+/*
+ * For each node, build memory block start and end addresses.
+ * Substract any memory hole from the range found in e820 map.
+ * vnode memory size are passed here in megabytes, the result is
+ * in memory block addresses.
+ * Linux kernel will adjust numa memory block sizes on its own.
+ * But we want to provide to the kernel numa block addresses that
+ * will be the same in kernel and hypervisor.
+ */
+#define max(a,b) ((a > b) ? a : b)
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                            uint32_t domid,
+                            /* IN: mem sizes in megabytes */
+                            libxl_domain_build_info *b_info,
+                            /* OUT: linux NUMA blocks addresses */
+                            vmemrange_t *memblks)
+{
+    unsigned int i;
+    int j, rc;
+    uint64_t next_start_blk, end_max = 0, size;
+    uint32_t nr;
+    struct e820entry map[E820MAX];
+
+    errno = ERROR_INVAL;
+    if (b_info->nr_nodes == 0)
+        return -EINVAL;
+
+    if (!memblks || !b_info->vnuma_mem)
+        return -EINVAL;
+
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    /* Retrieve e820 map for this host. */
+    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
+
+    if (rc < 0) {
+        errno = rc;
+        return -EINVAL;
+    }
+    nr = rc;
+    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    if (rc)
+    {
+        errno = rc;
+        return -EINVAL;
+    }
+
+    /* find max memory address for this host. */
+    for (j = 0; j < nr; j++)
+        if (map[j].type == E820_RAM) {
+            end_max = max(end_max, map[j].addr + map[j].size);
+        }
+
+    memset(memblks, 0, sizeof(*memblks) * b_info->nr_nodes);
+    next_start_blk = 0;
+
+    memblks[0].start = map[0].addr;
+
+    for (i = 0; i < b_info->nr_nodes; i++) {
+
+        memblks[i].start += next_start_blk;
+        memblks[i].end = memblks[i].start + (b_info->vnuma_mem[i] << 20);
+
+        if (memblks[i].end > end_max) {
+            LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,
+                    "Shrunk vNUMA memory block %d address to max e820 address: \
+                    %#010lx -> %#010lx\n", i, memblks[i].end, end_max);
+            memblks[i].end = end_max;
+            break;
+        }
+
+        size = memblks[i].end - memblks[i].start;
+        /*
+         * For pv host with e820_host option turned on we need
+         * to take into account memory holes. For pv host with
+         * e820_host disabled or unset, the map is a contiguous
+         * RAM region.
+         */
+        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
+            while((memblks[i].end - memblks[i].start -
+                   e820_memory_hole_size(memblks[i].start,
+                   memblks[i].end, map, nr)) < size )
+            {
+                memblks[i].end += MIN_VNODE_SIZE << 10;
+                if (memblks[i].end > end_max) {
+                    memblks[i].end = end_max;
+                    LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,
+                            "Shrunk vNUMA memory block %d address to max e820 \
+                            address: %#010lx -> %#010lx\n", i, memblks[i].end,
+                            end_max);
+                    break;
+                }
+            }
+        }
+        next_start_blk = memblks[i].end;
+        LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,"i %d, start  = %#010lx, \
+                    end = %#010lx\n", i, memblks[i].start, memblks[i].end);
+    }
+
+    /* Did not form memory addresses for every node? */
+    if (i != b_info->nr_nodes)  {
+        LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "Not all nodes were populated with \
+                block addresses, only %d out of %d", i, b_info->nr_nodes);
+        return -EINVAL;
+    }
+    return 0;
+}
+
+/*
  * Local variables:
  * mode: C
  * c-basic-offset: 4
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 7589060..46e84e4 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -1,5 +1,6 @@
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_vnuma.h"
 
 static const char *e820_names(int type)
 {
@@ -14,7 +15,7 @@ static const char *e820_names(int type)
     return "Unknown";
 }
 
-static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
                          uint32_t *nr_entries,
                          unsigned long map_limitkb,
                          unsigned long balloon_kb)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 09/10] libxl: vnuma nodes placement bits
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (7 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 08/10] libxl: build numa nodes memory blocks Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18  5:50 ` [PATCH v6 10/10] libxl: set vnuma for domain Elena Ufimtseva
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Automatic numa placement cancels manual vnode placement
mechanism. If numa placement explicitly specified, try
to fit vnodes to the physical nodes.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_create.c |    1 +
 tools/libxl/libxl_dom.c    |  148 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 0686f96..038f96e 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -188,6 +188,7 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
         return ERROR_FAIL;
 
     libxl_defbool_setdefault(&b_info->numa_placement, true);
+    libxl_defbool_setdefault(&b_info->vnuma_autoplacement, true);
 
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 83eb29a..98e42b3 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -23,6 +23,7 @@
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
 #include <xen/hvm/hvm_xs_strings.h>
+#include <libxl_vnuma.h>
 
 libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid)
 {
@@ -227,6 +228,61 @@ static void hvm_set_conf_params(xc_interface *handle, uint32_t domid,
                     libxl_defbool_val(info->u.hvm.nested_hvm));
 }
 
+/* sets vnode_to_pnode map. */
+static int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
+                        libxl_domain_build_info *info)
+{
+    unsigned int i, n;
+    int nr_nodes = 0;
+    uint64_t *vnodes_mem;
+    unsigned long long *nodes_claim = NULL;
+    libxl_numainfo *ninfo = NULL;
+
+    if (info->vnuma_vnodemap == NULL) {
+        info->vnuma_vnodemap = libxl__calloc(gc, info->nr_nodes,
+                                      sizeof(*info->vnuma_vnodemap));
+    }
+
+    /* default setting. */
+    for (i = 0; i < info->nr_nodes; i++)
+        info->vnuma_vnodemap[i] = VNUMA_NO_NODE;
+
+    /* Get NUMA info. */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+    /* Nothing to see if only one NUMA node. */
+    if (nr_nodes <= 1)
+        return 0;
+
+    vnodes_mem = info->vnuma_mem;
+    /*
+     * TODO: change algorithm. The current just fits the nodes
+     * by its memory sizes. If no p-node found, will be used default
+     * value of NUMA_NO_NODE.
+     */
+    nodes_claim = libxl__calloc(gc, info->nr_nodes, sizeof(*nodes_claim));
+    if ( !nodes_claim )
+        return ERROR_FAIL;
+
+    libxl_for_each_set_bit(n, info->nodemap)
+    {
+        for (i = 0; i < info->nr_nodes; i++)
+        {
+            unsigned long mem_sz = vnodes_mem[i] << 20;
+            if ((nodes_claim[n] + mem_sz <= ninfo[n].free) &&
+                 /* vnode was not set yet. */
+                 (info->vnuma_vnodemap[i] == VNUMA_NO_NODE ) )
+            {
+                info->vnuma_vnodemap[i] = n;
+                nodes_claim[n] += mem_sz;
+            }
+        }
+    }
+
+    return 0;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_config *d_config, libxl__domain_build_state *state)
 {
@@ -240,6 +296,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         return ERROR_FAIL;
     }
 
+    struct vmemrange *memrange = libxl__calloc(gc, info->nr_nodes,
+                                            sizeof(*memrange));
+    if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
+        LOG(DETAIL, "Failed to align memory map.\n");
+        return ERROR_FAIL;
+    }
+
+    /*
+     * NUMA placement and vNUMA autoplacement handling:
+     * If numa_placement is set to default, do not use vnode to pnode
+     * mapping as automatic placement algorithm will find best numa nodes.
+     * If numa_placement is not used, we can try and use domain vnode
+     * to pnode mask.
+     */
+
     /*
      * Check if the domain has any CPU or node affinity already. If not, try
      * to build up the latter via automatic NUMA placement. In fact, in case
@@ -268,7 +339,33 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         rc = numa_place_domain(gc, domid, info);
         if (rc)
             return rc;
+
+        /*
+         * If vnode_to_pnode mask was defined, dont use it if we automatically
+         * place domain on NUMA nodes, just give warning.
+         */
+        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
+            LOG(INFO, "Automatic NUMA placement for domain is turned on. \
+                vnode to physical nodes mapping will not be used.");
+        }
+        if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
+            LOG(ERROR, "Failed to build vnode to pnode map\n");
+            return ERROR_FAIL;
+        }
+    } else {
+        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
+                if (!libxl__vnodemap_is_usable(gc, info)) {
+                    LOG(ERROR, "Defined vnode to pnode domain map cannot be used.\n");
+                    return ERROR_FAIL;
+                }
+        } else {
+            if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
+                LOG(ERROR, "Failed to build vnode to pnode map.\n");
+                return ERROR_FAIL;
+            }
+        }
     }
+
     if (info->nodemap.size)
         libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
     /* As mentioned in libxl.h, vcpu_hard_array takes precedence */
@@ -294,6 +391,19 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         return ERROR_FAIL;
     }
 
+    /*
+     * XEN_DOMCTL_setvnuma subop hypercall needs to know max mem
+     * for domain set by xc_domain_setmaxmem. So set vNUMA after
+     * maxmem is being set.
+     */
+    if (xc_domain_setvnuma(ctx->xch, domid, info->nr_nodes,
+        info->max_vcpus, memrange,
+        info->vdistance, info->vnuma_vcpumap,
+        info->vnuma_vnodemap) < 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Couldn't set vnuma topology");
+        return ERROR_FAIL;
+    }
+
     xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL);
     state->store_domid = xs_domid ? atoi(xs_domid) : 0;
     free(xs_domid);
@@ -389,6 +499,39 @@ retry_transaction:
     return 0;
 }
 
+static int libxl__dom_vnuma_init(struct libxl_domain_build_info *info,
+                                 struct xc_dom_image *dom)
+{
+    errno = ERROR_INVAL;
+
+    if (info->nr_nodes == 0)
+        return -1;
+
+    dom->vnode_to_pnode = (unsigned int *)malloc(
+                            info->nr_nodes * sizeof(*info->vnuma_vnodemap));
+    dom->numa_memszs = (uint64_t *)malloc(
+                          info->nr_nodes * sizeof(*info->vnuma_mem));
+
+    errno = ERROR_FAIL;
+    if ( dom->numa_memszs == NULL || dom->vnode_to_pnode == NULL ) {
+        info->nr_nodes = 0;
+        if (dom->vnode_to_pnode)
+            free(dom->vnode_to_pnode);
+        if (dom->numa_memszs)
+            free(dom->numa_memszs);
+        return -1;
+    }
+
+    memcpy(dom->numa_memszs, info->vnuma_mem,
+            sizeof(*info->vnuma_mem) * info->nr_nodes);
+    memcpy(dom->vnode_to_pnode, info->vnuma_vnodemap,
+            sizeof(*info->vnuma_vnodemap) * info->nr_nodes);
+
+    dom->nr_nodes = info->nr_nodes;
+
+    return 0;
+}
+
 int libxl__build_pv(libxl__gc *gc, uint32_t domid,
              libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -446,6 +589,11 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
 
+    if ( (ret = libxl__dom_vnuma_init(info, dom)) != 0 ) {
+        LOGE(ERROR, "Failed to set doman vnuma");
+        goto out;
+    }
+
     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
         goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 10/10] libxl: set vnuma for domain
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (8 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 09/10] libxl: vnuma nodes placement bits Elena Ufimtseva
@ 2014-07-18  5:50 ` Elena Ufimtseva
  2014-07-18 10:58   ` Wei Liu
  2014-07-29 10:45   ` Ian Campbell
  2014-07-18  6:16 ` [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  5:50 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Call xc_domain_setvnuma to set vnuma topology for domain.
Prepares xc_dom_image for domain bootmem memory allocation

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl.c |   22 ++++++++++++++++++++++
 tools/libxl/libxl.h |   19 +++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 39f1c28..e9f2607 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -4807,6 +4807,28 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid,
     return 0;
 }
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                            uint32_t domid,
+                            uint16_t nr_vnodes,
+                            uint16_t nr_vcpus,
+                            vmemrange_t *vmemrange,
+                            unsigned int *vdistance,
+                            unsigned int *vcpu_to_vnode,
+                            unsigned int *vnode_to_pnode)
+{
+    int ret;
+    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
+                                nr_vcpus, vmemrange,
+                                vdistance,
+                                vcpu_to_vnode,
+                                vnode_to_pnode);
+    if (ret < 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Could not set vnuma topology for domain %d", domid);
+        return ERROR_FAIL;
+    }
+    return ret;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 459557d..1636c7f 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -330,6 +330,7 @@
 #include <sys/wait.h> /* for pid_t */
 
 #include <xentoollog.h>
+#include <xen/memory.h>
 
 typedef struct libxl__ctx libxl_ctx;
 
@@ -937,6 +938,15 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus);
 void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms);
 void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms);
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                           uint32_t domid,
+                           uint16_t nr_vnodes,
+                           uint16_t nr_vcpus,
+                           vmemrange_t *vmemrange,
+                           unsigned int *vdistance,
+                           unsigned int *vcpu_to_vnode,
+                           unsigned int *vnode_to_pnode);
+
 /*
  * Devices
  * =======
@@ -1283,6 +1293,15 @@ bool libxl_ms_vm_genid_is_zero(const libxl_ms_vm_genid *id);
 int libxl_fd_set_cloexec(libxl_ctx *ctx, int fd, int cloexec);
 int libxl_fd_set_nonblock(libxl_ctx *ctx, int fd, int nonblock);
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                           uint32_t domid,
+                           uint16_t nr_vnodes,
+                           uint16_t nr_vcpus,
+                           vmemrange_t *vmemrange,
+                           unsigned int *vdistance,
+                           unsigned int *vcpu_to_vnode,
+                           unsigned int *vnode_to_pnode);
+
 #include <libxl_event.h>
 
 #endif /* LIBXL_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (9 preceding siblings ...)
  2014-07-18  5:50 ` [PATCH v6 10/10] libxl: set vnuma for domain Elena Ufimtseva
@ 2014-07-18  6:16 ` Elena Ufimtseva
  2014-07-18  9:53 ` Wei Liu
  2014-07-22 12:49 ` Dario Faggioli
  12 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-18  6:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, Jan Beulich,
	Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 30150 bytes --]

Hello

Thanks to comments and suggestions from past versions, I am posting version
6 of vNUMA patches.
Wei, after sending I realized I did not add you to cc, my apologies.

Please send your comments and suggestions.
Thank you

Elena


On Fri, Jul 18, 2014 at 1:49 AM, Elena Ufimtseva <ufimtseva@gmail.com>
wrote:

> vNUMA introduction
>
> This series of patches introduces vNUMA topology awareness and
> provides interfaces and data structures to enable vNUMA for
> PV guests. There is a plan to extend this support for dom0 and
> HVM domains.
>
> vNUMA topology support should be supported by PV guest kernel.
> Corresponding patches should be applied.
>
> Introduction
> -------------
>
> vNUMA topology is exposed to the PV guest to improve performance when
> running
> workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
> machines and thus having virtual NUMA topology visible to guests.
> XEN vNUMA implementation provides a way to run vNUMA-enabled guests on
> NUMA/UMA
> and flexibly map vNUMA topology to physical NUMA topology.
>
> Mapping to physical NUMA topology may be done in manual and automatic way.
> By default, every PV domain has one vNUMA node. It is populated by default
> parameters and does not affect performance. To use automatic way of
> initializing
> vNUMA topology, configuration file need only to have number of vNUMA nodes
> defined. Not-defined vNUMA topology parameters will be initialized to
> default
> ones.
>
> vNUMA topology is currently defined as a set of parameters such as:
>     number of vNUMA nodes;
>     distance table;
>     vnodes memory sizes;
>     vcpus to vnodes mapping;
>     vnode to pnode map (for NUMA machines).
>
> This set of patches introduces two hypercall subops:
> XEN_DOMCTL_setvnumainfo
> and XENMEM_get_vnuma_info.
>
>     XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
> vNUMA topology with user defined configuration or the parameters by
> default.
> vNUMA is defined for every PV domain and if no vNUMA configuration found,
> one vNUMA node is initialized and all cpus are assigned to it. All other
> parameters set to their default values.
>
>     XENMEM_gevnumainfo is used by the PV domain to get the information
> from hypervisor about vNUMA topology. Guest sends its memory sizes
> allocated
> for different vNUMA parameters and hypervisor fills it with topology.
> Future work to use this in HVM guests in the toolstack is required and
> in the hypervisor to allow HVM guests to use these hypercalls.
>
> libxl
>
> libxl allows us to define vNUMA topology in configuration file and
> verifies that
> configuration is correct. libxl also verifies mapping of vnodes to pnodes
> and
> uses it in case of NUMA-machine and if automatic placement was disabled.
> In case
> of incorrect/insufficient configuration, one vNUMA node will be initialized
> and populated with default values.
>
> libxc
>
> libxc builds the vnodes memory addresses for guest and makes necessary
> alignments to the addresses. It also takes into account guest e820 memory
> map
> configuration. The domain memory is allocated and vnode to pnode mapping
> is used to determine target node for particular vnode. If this mapping was
> not
> defined, it is not a NUMA machine or automatic NUMA placement is enabled,
> the
> default not node-specific allocation will be used.
>
> hypervisor vNUMA initialization
>
> PV guest
>
> As of now, only PV guest can take advantage of vNUMA functionality.
> Such guest allocates the memory for NUMA topology, sets number of nodes and
> cpus so hypervisor has information about how much memory guest has
> preallocated for vNUMA topology. Further guest makes subop hypercall
> XENMEM_getvnumainfo.
> If for some reason vNUMA topology cannot be initialized, Linux guest
> will have only one NUMA node initialized (standard Linux behavior).
> To enable this, vNUMA Linux patches should be applied and vNUMA supporting
> patches should be applied to PV kernel.
>
> Linux kernel patch is available here:
> https://git.gitorious.org/vnuma/linux_vnuma.git
> git://gitorious.org/vnuma/linux_vnuma.git
>
> Automatic vNUMA placement
>
> vNUMA automatic placement will be enabled if numa automatic placement is
> not in enabled or, if disabled, if vnode to pnode mapping is incorrect. If
> vnode to pnode mapping is correct and automatic NUMA placement disabled,
> vNUMA nodes will be allocated on nodes as it was specified in the guest
> config file.
>
> Xen patchset is available here:
> https://git.gitorious.org/vnuma/xen_vnuma.git
> git://gitorious.org/vnuma/xen_vnuma.git
>
>
> Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
>
> memory = 4000
> vcpus = 2
> # The name of the domain, change this if you want more than 1 VM.
> name = "null"
> vnodes = 2
> #vnumamem = [3000, 1000]
> #vnumamem = [4000,0]
> vdistance = [10, 20]
> vnuma_vcpumap = [1, 0]
> vnuma_vnodemap = [1]
> vnuma_autoplacement = 0
> #e820_host = 1
>
> [    0.000000] Linux version 3.15.0-rc8+ (assert@superpipe) (gcc version
> 4.7.2 (Debian 4.7.2-5) ) #43 SMP Fri Jun 27 01:23:11 EDT 2014
> [    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug
> loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all
> LOGLEVEL=8 earlyprintk=xen sched_debug
> [    0.000000] ACPI in unprivileged domain disabled
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
> [    0.000000] bootconsole [xenboot0] enabled
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0xfa000 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size
> 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0xf9e00000-0xf9ffffff]
> [    0.000000]  [mem 0xf9e00000-0xf9ffffff] page 4k
> [    0.000000] BRK [0x019c8000, 0x019c8fff] PGTABLE
> [    0.000000] BRK [0x019c9000, 0x019c9fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0xf8000000-0xf9dfffff]
> [    0.000000]  [mem 0xf8000000-0xf9dfffff] page 4k
> [    0.000000] BRK [0x019ca000, 0x019cafff] PGTABLE
> [    0.000000] BRK [0x019cb000, 0x019cbfff] PGTABLE
> [    0.000000] BRK [0x019cc000, 0x019ccfff] PGTABLE
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x80000000-0xf7ffffff]
> [    0.000000]  [mem 0x80000000-0xf7ffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0x7fffffff]
> [    0.000000]  [mem 0x00100000-0x7fffffff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x035c5fff]
> [    0.000000] Nodes received = 2
> [    0.000000] NUMA: Initialized distance table, cnt=2
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 1 [mem 0x7d000000-0xf9ffffff]
> [    0.000000]   NODE_DATA [mem 0xf9828000-0xf984efff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x7cffffff]
> [    0.000000]   node   1: [mem 0x7d000000-0xf9ffffff]
> [    0.000000] On node 0 totalpages: 511903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 7936 pages used for memmap
> [    0.000000]   DMA32 zone: 507904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 512000
> [    0.000000]   DMA32 zone: 8000 pages used for memmap
> [    0.000000]   DMA32 zone: 512000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81
> http://simplefirmware.org
> [    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] e820: [mem 0xfa000000-0xffffffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:2
> nr_node_ids:2
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007ac00000 s85888
> r8192 d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 2 zonelists in Node order, mobility grouping on.
>  Total pages: 1007882
> [    0.000000] Policy zone: DMA32
> [    0.000000] Kernel command line: root=/dev/xvda1 ro earlyprintk=xen
> debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all
> LOGLEVEL=8 earlyprintk=xen sched_debug
> [    0.000000] Memory: 3978224K/4095612K available (4022K kernel code,
> 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 117388K reserved)
> [    0.000000] Enabling automatic NUMA balancing. Configure with
> numa_balancing= or the kernel.numa_balancing sysctl
> [    0.000000] installing Xen timer for CPU 0
> [    0.000000] tsc: Detected 2394.276 MHz processor
> [    0.004000] Calibrating delay loop (skipped), value calculated using
> timer frequency.. 4788.55 BogoMIPS (lpj=9577104)
> [    0.004000] pid_max: default: 32768 minimum: 301
> [    0.004179] Dentry cache hash table entries: 524288 (order: 10, 4194304
> bytes)
> [    0.006782] Inode-cache hash table entries: 262144 (order: 9, 2097152
> bytes)
> [    0.007216] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [    0.007288] Mountpoint-cache hash table entries: 8192 (order: 4, 65536
> bytes)
> [    0.007935] CPU: Physical Processor ID: 0
> [    0.007942] CPU: Processor Core ID: 0
> [    0.007951] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
> [    0.007951] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0
> [    0.007951] tlb_flushall_shift: 6
> [    0.021249] cpu 0 spinlock event irq 17
> [    0.021292] Performance Events: unsupported p6 CPU model 45 no PMU
> driver, software events only.
> [    0.022162] NMI watchdog: disabled (cpu0): hardware events not enabled
> [    0.022625] installing Xen timer for CPU 1
>
> root@heatpipe:~# numactl --ha
> available: 2 nodes (0-1)
> node 0 cpus: 0
> node 0 size: 1933 MB
> node 0 free: 1894 MB
> node 1 cpus: 1
> node 1 size: 1951 MB
> node 1 free: 1926 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> root@heatpipe:~# numastat
>                            node0           node1
> numa_hit                   52257           92679
> numa_miss                      0               0
> numa_foreign                   0               0
> interleave_hit              4254            4238
> local_node                 52150           87364
> other_node                   107            5315
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 7 (total: 1024000):
> (XEN)     Node 0: 1024000
> (XEN)     Node 1: 0
> (XEN)     Domain has 2 vnodes, 2 vcpus
> (XEN)         vnode 0 - pnode 0, 2000 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 2000 MB, vcpu nums: 1
>
>
> memory = 4000
> vcpus = 8
> # The name of the domain, change this if you want more than 1 VM.
> name = "null1"
> vnodes = 8
> #vnumamem = [3000, 1000]
> vdistance = [10, 40]
> #vnuma_vcpumap = [1, 0, 3, 2]
> vnuma_vnodemap = [1, 0, 1, 1, 0, 0, 1, 1]
> vnuma_autoplacement = 1
> e820_host = 1
>
> [    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
> [    0.000000] 1-1 mapping on ac228->100000
> [    0.000000] Released 318936 pages of unused memory
> [    0.000000] Set 343512 page(s) to 1-1 mapping
> [    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
> [    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
> [    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
> [    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b5fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac6b6000-0x00000000ac7fafff] ACPI NVS
> [    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
> [    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
> [    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
> [    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
> [    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
> [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
> [    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
> [    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] No AGP bridge found
> [    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
> [    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
> [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size
> 24576
> [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
> [    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
> [    0.000000] BRK [0x019cd000, 0x019cdfff] PGTABLE
> [    0.000000] BRK [0x019ce000, 0x019cefff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
> [    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
> [    0.000000] BRK [0x019cf000, 0x019cffff] PGTABLE
> [    0.000000] BRK [0x019d0000, 0x019d0fff] PGTABLE
> [    0.000000] BRK [0x019d1000, 0x019d1fff] PGTABLE
> [    0.000000] BRK [0x019d2000, 0x019d2fff] PGTABLE
> [    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
> [    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
> [    0.000000]  [mem 0x00100000-0xac227fff] page 4k
> [    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
> [    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
> [    0.000000] RAMDISK: [mem 0x01dd8000-0x0347ffff]
> [    0.000000] Nodes received = 8
> [    0.000000] NUMA: Initialized distance table, cnt=8
> [    0.000000] Initmem setup node 0 [mem 0x00000000-0x1f3fffff]
> [    0.000000]   NODE_DATA [mem 0x1f3d9000-0x1f3fffff]
> [    0.000000] Initmem setup node 1 [mem 0x1f800000-0x3e7fffff]
> [    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
> [    0.000000] Initmem setup node 2 [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   NODE_DATA [mem 0x5dbd9000-0x5dbfffff]
> [    0.000000] Initmem setup node 3 [mem 0x5e000000-0x7cffffff]
> [    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
> [    0.000000] Initmem setup node 4 [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   NODE_DATA [mem 0x9c3d9000-0x9c3fffff]
> [    0.000000] Initmem setup node 5 [mem 0x9c800000-0x10f5d7fff]
> [    0.000000]   NODE_DATA [mem 0x10f5b1000-0x10f5d7fff]
> [    0.000000] Initmem setup node 6 [mem 0x10f800000-0x12e9d7fff]
> [    0.000000]   NODE_DATA [mem 0x12e9b1000-0x12e9d7fff]
> [    0.000000] Initmem setup node 7 [mem 0x12f000000-0x14ddd7fff]
> [    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> [    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
> [    0.000000]   node   0: [mem 0x00100000-0x1f3fffff]
> [    0.000000]   node   1: [mem 0x1f400000-0x3e7fffff]
> [    0.000000]   node   2: [mem 0x3e800000-0x5dbfffff]
> [    0.000000]   node   3: [mem 0x5dc00000-0x7cffffff]
> [    0.000000]   node   4: [mem 0x7d000000-0x9c3fffff]
> [    0.000000]   node   5: [mem 0x9c400000-0xac227fff]
> [    0.000000]   node   5: [mem 0x100000000-0x10f5d7fff]
> [    0.000000]   node   6: [mem 0x10f5d8000-0x12e9d7fff]
> [    0.000000]   node   7: [mem 0x12e9d8000-0x14ddd7fff]
> [    0.000000] On node 0 totalpages: 127903
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 1936 pages used for memmap
> [    0.000000]   DMA32 zone: 123904 pages, LIFO batch:31
> [    0.000000] On node 1 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 2 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 3 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 4 totalpages: 128000
> [    0.000000]   DMA32 zone: 2000 pages used for memmap
> [    0.000000]   DMA32 zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 5 totalpages: 128000
> [    0.000000]   DMA32 zone: 1017 pages used for memmap
> [    0.000000]   DMA32 zone: 65064 pages, LIFO batch:15
> [    0.000000]   Normal zone: 984 pages used for memmap
> [    0.000000]   Normal zone: 62936 pages, LIFO batch:15
> [    0.000000] On node 6 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] On node 7 totalpages: 128000
> [    0.000000]   Normal zone: 2000 pages used for memmap
> [    0.000000]   Normal zone: 128000 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81
> http://simplefirmware.org
> [    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac228000-0xac26bfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac26c000-0xac57ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac580000-0xac5a0fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5a1000-0xac5bbfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bc000-0xac5bdfff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5be000-0xac5befff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5bf000-0xac5cafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5cb000-0xac5d9fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5da000-0xac5fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac5fb000-0xac6b5fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac6b6000-0xac7fafff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac7fb000-0xac80efff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac80f000-0xac80ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac810000-0xac810fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac811000-0xac812fff]
> [    0.000000] PM: Registered nosave memory: [mem 0xac813000-0xad7fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xad800000-0xafffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb0000000-0xb3ffffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xb4000000-0xfed1ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed20000-0xfed3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed40000-0xfed4ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed50000-0xfed8ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfed90000-0xfedfffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfee00000-0xfeefffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xfef00000-0xff9fffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa00000-0xffa3ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0xffa40000-0xffffffff]
> [    0.000000] e820: [mem 0xb4000000-0xfed1ffff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.5-unstable (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:20 nr_cpumask_bits:20 nr_cpu_ids:8
> nr_node_ids:8
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88001e800000 s85888
> r8192 d20608 u2097152
> [    0.000000] pcpu-alloc: s85888 r8192 d20608 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 [4] 4 [5] 5 [6] 6 [7] 7
> [    0.000000] xen: PV spinlocks enabled
> [    0.000000] Built 8 zonelists in Node order, mobility grouping on.
>  Total pages: 1007881
> [    0.000000] Policy zone: Normal
> [    0.000000] Kernel command line: root=/dev/xvda1 ro console=hvc0 debug
>  kgdboc=hvc0 nokgdbroundup  initcall_debug debug
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Memory: 3976748K/4095612K available (4022K kernel code,
> 769K rwdata, 1744K rodata, 1532K init, 1472K bss, 118864K reserved)
>
> root@heatpipe:~# numactl --ha
> maxn: 7
> available: 8 nodes (0-7)
> node 0 cpus: 0
> node 0 size: 458 MB
> node 0 free: 424 MB
> node 1 cpus: 1
> node 1 size: 491 MB
> node 1 free: 481 MB
> node 2 cpus: 2
> node 2 size: 491 MB
> node 2 free: 482 MB
> node 3 cpus: 3
> node 3 size: 491 MB
> node 3 free: 485 MB
> node 4 cpus: 4
> node 4 size: 491 MB
> node 4 free: 485 MB
> node 5 cpus: 5
> node 5 size: 491 MB
> node 5 free: 484 MB
> node 6 cpus: 6
> node 6 size: 491 MB
> node 6 free: 486 MB
> node 7 cpus: 7
> node 7 size: 476 MB
> node 7 free: 471 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  40  40  40  40  40  40  40
>   1:  40  10  40  40  40  40  40  40
>   2:  40  40  10  40  40  40  40  40
>   3:  40  40  40  10  40  40  40  40
>   4:  40  40  40  40  10  40  40  40
>   5:  40  40  40  40  40  10  40  40
>   6:  40  40  40  40  40  40  10  40
>   7:  40  40  40  40  40  40  40  10
>
> root@heatpipe:~# numastat
>                            node0           node1           node2
> node3
> numa_hit                  182203           14574           23800
> 17017
> numa_miss                      0               0               0
>     0
> numa_foreign                   0               0               0
>     0
> interleave_hit              1016            1010            1051
>  1030
> local_node                180995           12906           23272
> 15338
> other_node                  1208            1668             528
>  1679
>
>                            node4           node5           node6
> node7
> numa_hit                   10621           15346            3529
>  3863
> numa_miss                      0               0               0
>     0
> numa_foreign                   0               0               0
>     0
> interleave_hit              1026            1017            1031
>  1029
> local_node                  8941           13680            1855
>  2184
> other_node                  1680            1666            1674
>  1679
>
> root@superpipe:~# xl debug-keys u
>
> (XEN) Domain 6 (total: 1024000):
> (XEN)     Node 0: 321064
> (XEN)     Node 1: 702936
> (XEN)     Domain has 8 vnodes, 8 vcpus
> (XEN)         vnode 0 - pnode 1, 500 MB, vcpu nums: 0
> (XEN)         vnode 1 - pnode 0, 500 MB, vcpu nums: 1
> (XEN)         vnode 2 - pnode 1, 500 MB, vcpu nums: 2
> (XEN)         vnode 3 - pnode 1, 500 MB, vcpu nums: 3
> (XEN)         vnode 4 - pnode 0, 500 MB, vcpu nums: 4
> (XEN)         vnode 5 - pnode 0, 1841 MB, vcpu nums: 5
> (XEN)         vnode 6 - pnode 1, 500 MB, vcpu nums: 6
> (XEN)         vnode 7 - pnode 1, 500 MB, vcpu nums: 7
>
> Current problems:
>
> Warning on CPU bringup on other node
>
>     The cpus in guest wich belong to different NUMA nodes are configured
>     to chare same l2 cache and thus considered to be siblings and cannot
>     be on the same node. One can see following WARNING during the boot
> time:
>
> [    0.022750] SMP alternatives: switching to SMP code
> [    0.004000] ------------[ cut here ]------------
> [    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303
> topology_sane.isra.8+0x67/0x79()
> [    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node!
> [node: 1 != 0]. Ignoring dependency.
> [    0.004000] Modules linked in:
> [    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
> [    0.004000]  0000000000000000 0000000000000009 ffffffff813df458
> ffff88007abe7e60
> [    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08
> ffffffff00000100
> [    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000
> 000000000000b018
> [    0.004000] Call Trace:
> [    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
> [    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
> [    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
> [    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
> [    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
> [    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
> [    0.035371] x86: Booted up 2 nodes, 2 CPUs
>
> The workaround is to specify cpuid in config file and not use SMT. But
> soon I will come up
> with some other acceptable solution.
>
> Incorrect amount of memory for nodes in debug-keys output
>
>     Since the node ranges per domain are saved in guest addresses, the
> memory
>     calculated is incorrect due to the guest e820 memory holes for some
> nodes.
>
> TODO:
>     - some modifications to automatic vnuma placement may be needed;
>     - vdistance extended configuration parser will need to be in place;
>     - SMT siblings problem (see above) will need a solution;
>
> Changes since v5:
>     - reorganized patches;
>     - modified domctl hypercall and added locking;
>     - added XSM hypercalls with basic policies;
>     - verify 32bit compatibility;
>
> Elena Ufimtseva (10):
>   xen: vnuma topology and subop hypercalls
>   xsm bits for vNUMA hypercalls
>   vnuma hook to debug-keys u
>   libxc: Introduce xc_domain_setvnuma to set vNUMA
>   libxl: vnuma topology configuration parser and doc
>   libxc: move code to arch_boot_alloc func
>   libxc: allocate domain memory for vnuma enabled domains
>   libxl: build numa nodes memory blocks
>   libxl: vnuma nodes placement bits
>   libxl: set vnuma for domain
>
>  docs/man/xl.cfg.pod.5               |   77 +++++++
>  tools/libxc/xc_dom.h                |   11 +
>  tools/libxc/xc_dom_x86.c            |   71 +++++-
>  tools/libxc/xc_domain.c             |   63 ++++++
>  tools/libxc/xenctrl.h               |    9 +
>  tools/libxc/xg_private.h            |    1 +
>  tools/libxl/libxl.c                 |   22 ++
>  tools/libxl/libxl.h                 |   19 ++
>  tools/libxl/libxl_create.c          |    1 +
>  tools/libxl/libxl_dom.c             |  148 ++++++++++++
>  tools/libxl/libxl_internal.h        |   12 +
>  tools/libxl/libxl_numa.c            |  193 ++++++++++++++++
>  tools/libxl/libxl_types.idl         |    6 +-
>  tools/libxl/libxl_vnuma.h           |    8 +
>  tools/libxl/libxl_x86.c             |    3 +-
>  tools/libxl/xl_cmdimpl.c            |  425
> +++++++++++++++++++++++++++++++++++
>  xen/arch/x86/numa.c                 |   29 ++-
>  xen/common/domain.c                 |   13 ++
>  xen/common/domctl.c                 |  167 ++++++++++++++
>  xen/common/memory.c                 |   69 ++++++
>  xen/include/public/domctl.h         |   29 +++
>  xen/include/public/memory.h         |   47 +++-
>  xen/include/xen/domain.h            |   11 +
>  xen/include/xen/sched.h             |    1 +
>  xen/include/xsm/dummy.h             |    6 +
>  xen/include/xsm/xsm.h               |    7 +
>  xen/xsm/dummy.c                     |    1 +
>  xen/xsm/flask/hooks.c               |   10 +
>  xen/xsm/flask/policy/access_vectors |    4 +
>  29 files changed, 1447 insertions(+), 16 deletions(-)
>  create mode 100644 tools/libxl/libxl_vnuma.h
>
> --
> 1.7.10.4
>
>


-- 
Elena

[-- Attachment #1.2: Type: text/html, Size: 34583 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (10 preceding siblings ...)
  2014-07-18  6:16 ` [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
@ 2014-07-18  9:53 ` Wei Liu
  2014-07-18 10:13   ` Dario Faggioli
  2014-07-22 12:49 ` Dario Faggioli
  12 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-18  9:53 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

Hi! Another new series!

On Fri, Jul 18, 2014 at 01:49:59AM -0400, Elena Ufimtseva wrote:
[...]
> Current problems:
> 
> Warning on CPU bringup on other node
> 
>     The cpus in guest wich belong to different NUMA nodes are configured
>     to chare same l2 cache and thus considered to be siblings and cannot
>     be on the same node. One can see following WARNING during the boot time:
> 
> [    0.022750] SMP alternatives: switching to SMP code
> [    0.004000] ------------[ cut here ]------------
> [    0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:303 topology_sane.isra.8+0x67/0x79()
> [    0.004000] sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> [    0.004000] Modules linked in:
> [    0.004000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc8+ #43
> [    0.004000]  0000000000000000 0000000000000009 ffffffff813df458 ffff88007abe7e60
> [    0.004000]  ffffffff81048963 ffff88007abe7e70 ffffffff8102fb08 ffffffff00000100
> [    0.004000]  0000000000000001 ffff8800f6e13900 0000000000000000 000000000000b018
> [    0.004000] Call Trace:
> [    0.004000]  [<ffffffff813df458>] ? dump_stack+0x41/0x51
> [    0.004000]  [<ffffffff81048963>] ? warn_slowpath_common+0x78/0x90
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff81048a13>] ? warn_slowpath_fmt+0x45/0x4a
> [    0.004000]  [<ffffffff8102fb08>] ? topology_sane.isra.8+0x67/0x79
> [    0.004000]  [<ffffffff8102fd2e>] ? set_cpu_sibling_map+0x1c9/0x3f7
> [    0.004000]  [<ffffffff81042146>] ? numa_add_cpu+0xa/0x18
> [    0.004000]  [<ffffffff8100b4e2>] ? cpu_bringup+0x50/0x8f
> [    0.004000]  [<ffffffff8100b544>] ? cpu_bringup_and_idle+0x1d/0x28
> [    0.004000] ---[ end trace 0e2e2fd5c7b76da5 ]---
> [    0.035371] x86: Booted up 2 nodes, 2 CPUs
> 
> The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
> with some other acceptable solution.
> 

I've also encountered this. I suspect that even if you disble SMT with
cpuid in config file, the cpu topology in guest might still be wrong.
What do hwloc-ls and lscpu show? Do you see any weird topology like one
core belongs to one node while three belong to another? (I suspect not
because your vcpus are already pinned to a specific node)

What I did was to manipulate various "id"s in Linux kernel, so that I
create a topology like 1 core : 1 cpu : 1 socket mapping. In that case
guest scheduler won't be able to make any assumption on individual CPU
sharing caches with each other.

In any case we've already manipulated various ids of CPU0, I don't see
it harm to manipulate other CPUs as well.

Thoughts?

P.S. I'm benchmarking your v5, tell me if you're interested in the
result.

Wei.

(This patch should be applied to Linux and it's by no mean suitable for
upstream as is)
---8<---
>From be2b33088e521284c27d6a7679b652b688dba83d Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Tue, 17 Jun 2014 14:51:57 +0100
Subject: [PATCH] XXX: CPU topology hack!

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
 arch/x86/xen/smp.c   |   17 +++++++++++++++++
 arch/x86/xen/vnuma.c |    2 ++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 7005974..89656fe 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -81,6 +81,15 @@ static void cpu_bringup(void)
 	cpu = smp_processor_id();
 	smp_store_cpu_info(cpu);
 	cpu_data(cpu).x86_max_cores = 1;
+	cpu_physical_id(cpu) = cpu;
+	cpu_data(cpu).phys_proc_id = cpu;
+	cpu_data(cpu).cpu_core_id = cpu;
+	cpu_data(cpu).initial_apicid = cpu;
+	cpu_data(cpu).apicid = cpu;
+	per_cpu(cpu_llc_id, cpu) = cpu;
+	if (numa_cpu_node(cpu) != NUMA_NO_NODE)
+		cpu_data(cpu).phys_proc_id = numa_cpu_node(cpu);
+
 	set_cpu_sibling_map(cpu);
 
 	xen_setup_cpu_clockevents();
@@ -326,6 +335,14 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
 
 	smp_store_boot_cpu_info();
 	cpu_data(0).x86_max_cores = 1;
+	cpu_physical_id(0) = 0;
+	cpu_data(0).phys_proc_id = 0;
+	cpu_data(0).cpu_core_id = 0;
+	per_cpu(cpu_llc_id, cpu) = 0;
+	cpu_data(0).initial_apicid = 0;
+	cpu_data(0).apicid = 0;
+	if (numa_cpu_node(0) != NUMA_NO_NODE)
+		per_cpu(x86_cpu_to_node_map, 0) = numa_cpu_node(0);
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var(&per_cpu(cpu_sibling_map, i), GFP_KERNEL);
diff --git a/arch/x86/xen/vnuma.c b/arch/x86/xen/vnuma.c
index a02f9c6..418ced2 100644
--- a/arch/x86/xen/vnuma.c
+++ b/arch/x86/xen/vnuma.c
@@ -81,7 +81,9 @@ int __init xen_numa_init(void)
 	setup_nr_node_ids();
 	/* Setting the cpu, apicid to node */
 	for_each_cpu(cpu, cpu_possible_mask) {
+		/* Use cpu id as apicid */
 		set_apicid_to_node(cpu, cpu_to_node[cpu]);
+		cpu_data(cpu).initial_apicid = cpu;
 		numa_set_node(cpu, cpu_to_node[cpu]);
 		cpumask_set_cpu(cpu, node_to_cpumask_map[cpu_to_node[cpu]]);
 	}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18  9:53 ` Wei Liu
@ 2014-07-18 10:13   ` Dario Faggioli
  2014-07-18 11:48     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-18 10:13 UTC (permalink / raw)
  To: Wei Liu
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 2585 bytes --]

On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:
> Hi! Another new series!
> 
:-)

> On Fri, Jul 18, 2014 at 01:49:59AM -0400, Elena Ufimtseva wrote:

> > The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
> > with some other acceptable solution.
> > 
> 
For Elena, workaround like what?

> I've also encountered this. I suspect that even if you disble SMT with
> cpuid in config file, the cpu topology in guest might still be wrong.
>
Can I ask why?

> What do hwloc-ls and lscpu show? Do you see any weird topology like one
> core belongs to one node while three belong to another?
>
Yep, that would be interesting to see.

>  (I suspect not
> because your vcpus are already pinned to a specific node)
> 
Sorry, I'm not sure I follow here... Are you saying that things probably
works ok, but that is (only) because of pinning?

I may be missing something here, but would it be possible to at least
try to make sure that the virtual topology and the topology related
content of CPUID actually agree? And I mean doing it automatically (if
only one of the two is specified) and to either error or warn if that is
not possible (if both are specified and they disagree)?

I admit I'm not a CPUID expert, but I always thought this could be a
good solution...

> What I did was to manipulate various "id"s in Linux kernel, so that I
> create a topology like 1 core : 1 cpu : 1 socket mapping. 
>
And how this topology maps/interact with the virtual topology we want
the guest to have?

> In that case
> guest scheduler won't be able to make any assumption on individual CPU
> sharing caches with each other.
> 
And, apart from SMT, what topology does the guest see then?

In any case, if this only alter SMT-ness (where "alter"="disable"), I
think that is fine too. What I'm failing at seeing is whether and why
this approach is more powerful than manipulating CPUID from config file.

I'm insisting because, if they'd be equivalent, in terms of results, I
think it's easier, cleaner and more correct to deal with CPUID in xl and
libxl (automatically or semi-automatically).

> P.S. I'm benchmarking your v5, tell me if you're interested in the
> result.
> 
wow, cool! I guess we all are! :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
@ 2014-07-18 10:30   ` Wei Liu
  2014-07-20 13:16     ` Elena Ufimtseva
  2014-07-18 13:49   ` Konrad Rzeszutek Wilk
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-18 10:30 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

On Fri, Jul 18, 2014 at 01:50:00AM -0400, Elena Ufimtseva wrote:
[...]
> +/*
> + * Allocate memory and construct one vNUMA node,
> + * set default parameters, assign all memory and
> + * vcpus to this node, set distance to 10.
> + */
> +static long vnuma_fallback(const struct domain *d,
> +                          struct vnuma_info **vnuma)
> +{
> +    struct vnuma_info *v;
> +    long ret;
> +
> +
> +    /* Will not destroy vNUMA here, destroy before calling this. */
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
> +    if ( ret )
> +        return ret;
> +
> +    v->vmemrange[0].start = 0;
> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
> +    v->vdistance[0] = 10;
> +    v->vnode_to_pnode[0] = NUMA_NO_NODE;
> +    memset(v->vcpu_to_vnode, 0, d->max_vcpus);
> +    v->nr_vnodes = 1;
> +
> +    *vnuma = v;
> +
> +    return 0;
> +}
> +

I have question about this strategy. Is there any reason to choose to
fallback to this one node? In that case the toolstack will have
different view of the guest than the hypervisor. Toolstack still thinks
this guest has several nodes while this guest has only one. The can
cause problem when migrating a guest. Consider this, toolstack on the
remote end still builds two nodes given the fact that it's what it
knows, then the guest originally has one node notices the change in
underlying memory topology and crashes.

IMHO we should just fail in this case. It's not that common to fail a
small array allocation anyway. This approach can also save you from
writing this function. :-)

> +/*
> + * construct vNUMA topology form u_vnuma struct and return
> + * it in dst.
> + */
[...]
> +
> +    /* On failure, set only one vNUMA node and its success. */
> +    ret = 0;
> +
> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
> +        d->max_vcpus) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
> +        nr_vnodes) )
> +        goto vnuma_onenode;
> +
> +    v->nr_vnodes = nr_vnodes;
> +    *dst = v;
> +
> +    return ret;
> +
> +vnuma_onenode:
> +    vnuma_destroy(v);
> +    return vnuma_fallback(d, dst);
> +}
> +
>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>  {
>      long ret = 0;
> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>      }
>      break;
>  
[...]
> +/*
> + * vNUMA topology specifies vNUMA node number, distance table,
> + * memory ranges and vcpu mapping provided for guests.
> + * XENMEM_get_vnumainfo hypercall expects to see from guest
> + * nr_vnodes and nr_vcpus to indicate available memory. After
> + * filling guests structures, nr_vnodes and nr_vcpus copied
> + * back to guest.
> + */
> +struct vnuma_topology_info {
> +    /* IN */
> +    domid_t domid;
> +    /* IN/OUT */
> +    unsigned int nr_vnodes;
> +    unsigned int nr_vcpus;
> +    /* OUT */
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t pad;
> +    } vdistance;
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t pad;
> +    } vcpu_to_vnode;
> +    union {
> +        XEN_GUEST_HANDLE(vmemrange_t) h;
> +        uint64_t pad;
> +    } vmemrange;

Why do you need to use union? The other interface you introduce in this
patch doesn't use union.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA
  2014-07-18  5:50 ` [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA Elena Ufimtseva
@ 2014-07-18 10:33   ` Wei Liu
  2014-07-29 10:33   ` Ian Campbell
  1 sibling, 0 replies; 63+ messages in thread
From: Wei Liu @ 2014-07-18 10:33 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

On Fri, Jul 18, 2014 at 01:50:03AM -0400, Elena Ufimtseva wrote:
[...]
> +int xc_domain_setvnuma(xc_interface *xch,
> +                        uint32_t domid,
> +                        uint16_t nr_vnodes,
> +                        uint16_t nr_vcpus,
> +                        vmemrange_t *vmemrange,
> +                        unsigned int *vdistance,
> +                        unsigned int *vcpu_to_vnode,
> +                        unsigned int *vnode_to_pnode)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
> +                                    nr_vnodes * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
> +                                    nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);

Indentation.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
@ 2014-07-18 10:53   ` Wei Liu
  2014-07-20 14:04     ` Elena Ufimtseva
  2014-07-29 10:38   ` Ian Campbell
  2014-07-29 10:42   ` Ian Campbell
  2 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-18 10:53 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

On Fri, Jul 18, 2014 at 01:50:04AM -0400, Elena Ufimtseva wrote:
> Parses vnuma topoplogy number of nodes and memory
> ranges. If not defined, initializes vnuma with
> only one node and default topology. This one node covers
> all domain memory and all vcpus assigned to it.
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  docs/man/xl.cfg.pod.5       |   77 ++++++++
>  tools/libxl/libxl_types.idl |    6 +-
>  tools/libxl/libxl_vnuma.h   |    8 +
>  tools/libxl/xl_cmdimpl.c    |  425 +++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 515 insertions(+), 1 deletion(-)
>  create mode 100644 tools/libxl/libxl_vnuma.h
> 
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> index ff9ea77..0c7fbf8 100644
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -242,6 +242,83 @@ if the values of B<memory=> and B<maxmem=> differ.
>  A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver
>  it will crash.
>  
> +=item B<vnuma_nodes=N>
> +
> +Number of vNUMA nodes the guest will be initialized with on boot.
> +PV guest by default will have one vnuma node.
> +

In the future, all these config options will be used for HVM / PVH
guests as well. But I'm fine with leaving it as is for the moment.

[...]
>  
>  =head3 Event Actions
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index de25f42..5876822 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -318,7 +318,11 @@ libxl_domain_build_info = Struct("domain_build_info",[
>      ("disable_migrate", libxl_defbool),
>      ("cpuid",           libxl_cpuid_policy_list),
>      ("blkdev_start",    string),
> -    
> +    ("vnuma_mem",     Array(uint64, "nr_nodes")),
> +    ("vnuma_vcpumap",     Array(uint32, "nr_nodemap")),
> +    ("vdistance",        Array(uint32, "nr_dist")),
> +    ("vnuma_vnodemap",  Array(uint32, "nr_node_to_pnode")),

The main problem here is that we need to name counter variables
num_VARs. See idl.txt, idl.Array section.

> +    ("vnuma_autoplacement",  libxl_defbool),
>      ("device_model_version", libxl_device_model_version),
>      ("device_model_stubdomain", libxl_defbool),
>      # if you set device_model you must set device_model_version too
> diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
> new file mode 100644
> index 0000000..4ff4c57
> --- /dev/null
> +++ b/tools/libxl/libxl_vnuma.h
> @@ -0,0 +1,8 @@
> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#define VNUMA_NO_NODE ~((unsigned int)0)
> +
> +/* Max vNUMA node size from Linux. */

Should be "Min" I guess.

> +#define MIN_VNODE_SIZE  32U
> +
> +#define MAX_VNUMA_NODES (unsigned int)1 << 10

Does this also come from Linux? Or is it just some arbitrary number?
Worth a comment here.

> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 68df548..5d91c2c 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -40,6 +40,7 @@
>  #include "libxl_json.h"
>  #include "libxlutil.h"
>  #include "xl.h"
> +#include "libxl_vnuma.h"
>  
>  /* For calls which return an errno on failure */
>  #define CHK_ERRNOVAL( call ) ({                                         \
> @@ -690,6 +691,423 @@ static void parse_top_level_sdl_options(XLU_Config *config,
>      xlu_cfg_replace_string (config, "xauthority", &sdl->xauthority, 0);
>  }
>  
> +
> +static unsigned int get_list_item_uint(XLU_ConfigList *list, unsigned int i)
> +{
> +    const char *buf;
> +    char *ep;
> +    unsigned long ul;
> +    int rc = -EINVAL;
> +
> +    buf = xlu_cfg_get_listitem(list, i);
> +    if (!buf)
> +        return rc;
> +    ul = strtoul(buf, &ep, 10);
> +    if (ep == buf)
> +        return rc;
> +    if (ul >= UINT16_MAX)
> +        return rc;
> +    return (unsigned int)ul;
> +}
> +
> +static void vdistance_set(unsigned int *vdistance,
> +                                unsigned int nr_vnodes,
> +                                unsigned int samenode,
> +                                unsigned int othernode)
> +{
> +    unsigned int idx, slot;
> +    for (idx = 0; idx < nr_vnodes; idx++)
> +        for (slot = 0; slot < nr_vnodes; slot++)
> +            *(vdistance + slot * nr_vnodes + idx) =
> +                idx == slot ? samenode : othernode;
> +}
> +
> +static void vcputovnode_default(unsigned int *cpu_to_node,
> +                                unsigned int nr_vnodes,
> +                                unsigned int max_vcpus)
> +{
> +    unsigned int cpu;
> +    for (cpu = 0; cpu < max_vcpus; cpu++)
> +        cpu_to_node[cpu] = cpu % nr_vnodes;
> +}
> +
> +/* Split domain memory between vNUMA nodes equally. */
> +static int split_vnumamem(libxl_domain_build_info *b_info)
> +{
> +    unsigned long long vnodemem = 0;
> +    unsigned long n;
> +    unsigned int i;
> +
> +    if (b_info->nr_nodes == 0)
> +        return -1;
> +
> +    vnodemem = (b_info->max_memkb >> 10) / b_info->nr_nodes;
> +    if (vnodemem < MIN_VNODE_SIZE)
> +        return -1;
> +    /* reminder in MBytes. */
> +    n = (b_info->max_memkb >> 10) % b_info->nr_nodes;
> +    /* get final sizes in MBytes. */
> +    for (i = 0; i < (b_info->nr_nodes - 1); i++)
> +        b_info->vnuma_mem[i] = vnodemem;
> +    /* add the reminder to the last node. */
> +    b_info->vnuma_mem[i] = vnodemem + n;
> +    return 0;
> +}
> +
> +static void vnuma_vnodemap_default(unsigned int *vnuma_vnodemap,
> +                                   unsigned int nr_vnodes)
> +{
> +    unsigned int i;
> +    for (i = 0; i < nr_vnodes; i++)
> +        vnuma_vnodemap[i] = VNUMA_NO_NODE;
> +}
> +
> +/*
> + * init vNUMA to "zero config" with one node and all other
> + * topology parameters set to default.
> + */
> +static int vnuma_zero_config(libxl_domain_build_info *b_info)
> +{

Haven't looked into details of this function, but I think this should be
renamed to vnuma_default_config, from the reading of comment.

> +    b_info->nr_nodes = 1;
> +    /* all memory goes to this one vnode, as well as vcpus. */
> +    if (!(b_info->vnuma_mem = (uint64_t *)calloc(b_info->nr_nodes,
> +                                sizeof(*b_info->vnuma_mem))))
> +        goto bad_vnumazerocfg;
> +
> +    if (!(b_info->vnuma_vcpumap = (unsigned int *)calloc(b_info->max_vcpus,
> +                                sizeof(*b_info->vnuma_vcpumap))))
> +        goto bad_vnumazerocfg;
> +
> +    if (!(b_info->vdistance = (unsigned int *)calloc(b_info->nr_nodes *
> +                                b_info->nr_nodes, sizeof(*b_info->vdistance))))
> +        goto bad_vnumazerocfg;
> +
> +    if (!(b_info->vnuma_vnodemap = (unsigned int *)calloc(b_info->nr_nodes,
> +                                sizeof(*b_info->vnuma_vnodemap))))
> +        goto bad_vnumazerocfg;
> +
> +    b_info->vnuma_mem[0] = b_info->max_memkb >> 10;
> +
> +    /* all vcpus assigned to this vnode. */
> +    vcputovnode_default(b_info->vnuma_vcpumap, b_info->nr_nodes,
> +                        b_info->max_vcpus);
> +
> +    /* default vdistance is 10. */
> +    vdistance_set(b_info->vdistance, b_info->nr_nodes, 10, 10);
> +
> +    /* VNUMA_NO_NODE for vnode_to_pnode. */
> +    vnuma_vnodemap_default(b_info->vnuma_vnodemap, b_info->nr_nodes);
> +
> +    /*
> +     * will be placed to some physical nodes defined by automatic
> +     * numa placement or VNUMA_NO_NODE will not request exact node.
> +     */
> +    libxl_defbool_set(&b_info->vnuma_autoplacement, true);
> +    return 0;
> +
> + bad_vnumazerocfg:
> +    return -1;
> +}
> +
> +static void free_vnuma_info(libxl_domain_build_info *b_info)
> +{
> +    free(b_info->vnuma_mem);
> +    free(b_info->vdistance);
> +    free(b_info->vnuma_vcpumap);
> +    free(b_info->vnuma_vnodemap);
> +    b_info->nr_nodes = 0;
> +}
> +
> +static int parse_vnuma_mem(XLU_Config *config,
> +                            libxl_domain_build_info **b_info)
> +{
> +    libxl_domain_build_info *dst;
> +    XLU_ConfigList *vnumamemcfg;
> +    int nr_vnuma_regions, i;
> +    unsigned long long vnuma_memparsed = 0;
> +    unsigned long ul;
> +    const char *buf;
> +
> +    dst = *b_info;
> +    if (!xlu_cfg_get_list(config, "vnuma_mem",
> +                          &vnumamemcfg, &nr_vnuma_regions, 0)) {
> +
> +        if (nr_vnuma_regions != dst->nr_nodes) {
> +            fprintf(stderr, "Number of numa regions (vnumamem = %d) is \
> +                    incorrect (should be %d).\n", nr_vnuma_regions,
> +                    dst->nr_nodes);
> +            goto bad_vnuma_mem;
> +        }
> +
> +        dst->vnuma_mem = calloc(dst->nr_nodes,
> +                                 sizeof(*dst->vnuma_mem));
> +        if (dst->vnuma_mem == NULL) {
> +            fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
> +            goto bad_vnuma_mem;
> +        }
> +
> +        char *ep;
> +        /*
> +         * Will parse only nr_vnodes times, even if we have more/less regions.
> +         * Take care of it later if less or discard if too many regions.
> +         */
> +        for (i = 0; i < dst->nr_nodes; i++) {
> +            buf = xlu_cfg_get_listitem(vnumamemcfg, i);
> +            if (!buf) {
> +                fprintf(stderr,
> +                        "xl: Unable to get element %d in vnuma memory list.\n", i);
> +                if (vnuma_zero_config(dst))
> +                    goto bad_vnuma_mem;
> +
> +            }

I think we should fail here instead of creating "default" config. See
the reasoning I made on hypervisor side.

> +            ul = strtoul(buf, &ep, 10);
> +            if (ep == buf) {
> +                fprintf(stderr, "xl: Invalid argument parsing vnumamem: %s.\n", buf);
> +                if (vnuma_zero_config(dst))
> +                    goto bad_vnuma_mem;
> +            }
> +

Ditto.

> +            /* 32Mb is a min size for a node, taken from Linux */
> +            if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) {
> +                fprintf(stderr, "xl: vnuma memory %lu is not within %u - %u range.\n",
> +                        ul, MIN_VNODE_SIZE, UINT32_MAX);
> +                if (vnuma_zero_config(dst))
> +                    goto bad_vnuma_mem;
> +            }
> +

Ditto.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-07-18  5:50 ` [PATCH v6 10/10] libxl: set vnuma for domain Elena Ufimtseva
@ 2014-07-18 10:58   ` Wei Liu
  2014-07-29 10:45   ` Ian Campbell
  1 sibling, 0 replies; 63+ messages in thread
From: Wei Liu @ 2014-07-18 10:58 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

On Fri, Jul 18, 2014 at 01:50:09AM -0400, Elena Ufimtseva wrote:
[...]
>  int libxl_fd_set_cloexec(libxl_ctx *ctx, int fd, int cloexec);
>  int libxl_fd_set_nonblock(libxl_ctx *ctx, int fd, int nonblock);
>  
> +int libxl_domain_setvnuma(libxl_ctx *ctx,
> +                           uint32_t domid,
> +                           uint16_t nr_vnodes,
> +                           uint16_t nr_vcpus,
> +                           vmemrange_t *vmemrange,
> +                           unsigned int *vdistance,
> +                           unsigned int *vcpu_to_vnode,
> +                           unsigned int *vnode_to_pnode);
> +
>  #include <libxl_event.h>
>  

You will need to add
  #define LIBXL_HAVE_DOMAIN_SETVNUMA 1
to advertise the introduction of new API.

Wei.

>  #endif /* LIBXL_H */
> -- 
> 1.7.10.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/10] libxl: build numa nodes memory blocks
  2014-07-18  5:50 ` [PATCH v6 08/10] libxl: build numa nodes memory blocks Elena Ufimtseva
@ 2014-07-18 11:01   ` Wei Liu
  2014-07-20 12:58     ` Elena Ufimtseva
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-18 11:01 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

On Fri, Jul 18, 2014 at 01:50:07AM -0400, Elena Ufimtseva wrote:
[...]
>  
> +bool libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info);
> +
> +int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> +                                  uint32_t *nr_entries,
> +                                  unsigned long map_limitkb,
> +                                  unsigned long balloon_kb);
> +

e820_sanitize should not take a ctx? It's internal function anyway.
But this is not your fault so don't worry about it.

And this function seems to be arch-specific so I wonder if there's
better place for it.

> +int libxl__vnuma_align_mem(libxl__gc *gc,
> +                                     uint32_t domid,
> +                                     struct libxl_domain_build_info *b_info,
> +                                     vmemrange_t *memblks);
> +

Indentation.

> +/*
> + * For each node, build memory block start and end addresses.
> + * Substract any memory hole from the range found in e820 map.
> + * vnode memory size are passed here in megabytes, the result is
> + * in memory block addresses.
> + * Linux kernel will adjust numa memory block sizes on its own.
> + * But we want to provide to the kernel numa block addresses that
> + * will be the same in kernel and hypervisor.
> + */
> +#define max(a,b) ((a > b) ? a : b)

You won't need to redefine max, I think we have this somewhere already.

(Haven't looked into the placement strategy, I guess you, Dario and
George have come to agreement on how this should be implemented)

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18 10:13   ` Dario Faggioli
@ 2014-07-18 11:48     ` Wei Liu
  2014-07-20 14:57       ` Elena Ufimtseva
  2014-07-22 14:03       ` Dario Faggioli
  0 siblings, 2 replies; 63+ messages in thread
From: Wei Liu @ 2014-07-18 11:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva,
	Wei Liu

On Fri, Jul 18, 2014 at 12:13:36PM +0200, Dario Faggioli wrote:
> On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:
> > Hi! Another new series!
> > 
> :-)
> 
> > On Fri, Jul 18, 2014 at 01:49:59AM -0400, Elena Ufimtseva wrote:
> 
> > > The workaround is to specify cpuid in config file and not use SMT. But soon I will come up
> > > with some other acceptable solution.
> > > 
> > 
> For Elena, workaround like what?
> 
> > I've also encountered this. I suspect that even if you disble SMT with
> > cpuid in config file, the cpu topology in guest might still be wrong.
> >
> Can I ask why?
> 

Because for a PV guest (currently) the guest kernel sees the real "ID"s
for a cpu. See those "ID"s I change in my hacky patch.

> > What do hwloc-ls and lscpu show? Do you see any weird topology like one
> > core belongs to one node while three belong to another?
> >
> Yep, that would be interesting to see.
> 
> >  (I suspect not
> > because your vcpus are already pinned to a specific node)
> > 
> Sorry, I'm not sure I follow here... Are you saying that things probably
> works ok, but that is (only) because of pinning?

Yes, given that you derive numa memory allocation from cpu pinning or
use combination of cpu pinning, vcpu to vnode map and vnode to pnode
map, in those cases those IDs might reflect the right topology.

> 
> I may be missing something here, but would it be possible to at least
> try to make sure that the virtual topology and the topology related
> content of CPUID actually agree? And I mean doing it automatically (if

This is what I'm doing in my hack. :-)

> only one of the two is specified) and to either error or warn if that is
> not possible (if both are specified and they disagree)?
> 
> I admit I'm not a CPUID expert, but I always thought this could be a
> good solution...
> 
> > What I did was to manipulate various "id"s in Linux kernel, so that I
> > create a topology like 1 core : 1 cpu : 1 socket mapping. 
> >
> And how this topology maps/interact with the virtual topology we want
> the guest to have?
> 

Say you have a two nodes guest, with 4 vcpus, you now have two sockets
per node, each socket has one cpu, each cpu has one core.

Node 0:
  Socket 0:
    CPU0:
      Core 0
  Socket 1:
    CPU 1:
      Core 1
Node 1:
  Socket 2:
    CPU 2:
      Core 2
  Socket 3:
    CPU 3:
      Core 3

> > In that case
> > guest scheduler won't be able to make any assumption on individual CPU
> > sharing caches with each other.
> > 
> And, apart from SMT, what topology does the guest see then?
> 

See above.

> In any case, if this only alter SMT-ness (where "alter"="disable"), I
> think that is fine too. What I'm failing at seeing is whether and why
> this approach is more powerful than manipulating CPUID from config file.
> 
> I'm insisting because, if they'd be equivalent, in terms of results, I
> think it's easier, cleaner and more correct to deal with CPUID in xl and
> libxl (automatically or semi-automatically).
> 

SMT is just one aspect of the story that easily surfaces.

In my opinion, if we don't manually create some kind of topology for the
guest, the guest might end up with something weird. For example, if you
have a 2 nodes, 4 sockets, 8 cpus, 8 cores system, you might have

Node 0:
  Socket 0
    CPU0
  Socket 1
    CPU1
Node 1:
  Socket 2
    CPU 3
    CPU 4

which all stems from guest having knowledge of real CPU "ID"s.

And this topology is just wrong, it might just be the one during guest
creation. Xen is free to schedule vcpus to different pcpus, so guest
scheduler will make wrong decision based on errnous information.

That's why I chose to have 1 core : 1 cpu : 1 socket mapping, so that
guest makes no assumption on cache sharing etc. It's suboptimal but
should provide predictable average performance. What do you think?

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
  2014-07-18 10:30   ` Wei Liu
@ 2014-07-18 13:49   ` Konrad Rzeszutek Wilk
  2014-07-20 13:26     ` Elena Ufimtseva
  2014-07-22 15:14   ` Dario Faggioli
  2014-07-23 14:06   ` Jan Beulich
  3 siblings, 1 reply; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-18 13:49 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, Jul 18, 2014 at 01:50:00AM -0400, Elena Ufimtseva wrote:
> Define interface, structures and hypercalls for toolstack to
> build vnuma topology and for guests that wish to retrieve it.
> Two subop hypercalls introduced by patch:
> XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
> and XENMEM_get_vnumainfo to retrieve that topology by guest.
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  xen/common/domain.c         |   13 ++++
>  xen/common/domctl.c         |  167 +++++++++++++++++++++++++++++++++++++++++++
>  xen/common/memory.c         |   62 ++++++++++++++++
>  xen/include/public/domctl.h |   29 ++++++++
>  xen/include/public/memory.h |   47 +++++++++++-
>  xen/include/xen/domain.h    |   11 +++
>  xen/include/xen/sched.h     |    1 +
>  7 files changed, 329 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index cd64aea..895584a 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -584,6 +584,18 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
>      return 0;
>  }
>  
> +void vnuma_destroy(struct vnuma_info *vnuma)
> +{
> +    if ( vnuma )
> +    {
> +        xfree(vnuma->vmemrange);
> +        xfree(vnuma->vcpu_to_vnode);
> +        xfree(vnuma->vdistance);
> +        xfree(vnuma->vnode_to_pnode);
> +        xfree(vnuma);
> +    }
> +}
> +
>  int domain_kill(struct domain *d)
>  {
>      int rc = 0;
> @@ -602,6 +614,7 @@ int domain_kill(struct domain *d)
>          evtchn_destroy(d);
>          gnttab_release_mappings(d);
>          tmem_destroy(d->tmem_client);
> +        vnuma_destroy(d->vnuma);
>          domain_set_outstanding_pages(d, 0);
>          d->tmem_client = NULL;
>          /* fallthrough */
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index c326aba..7464284 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -297,6 +297,144 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
>              guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
>  }
>  
> +/*
> + * Allocates memory for vNUMA, **vnuma should be NULL.
> + * Caller has to make sure that domain has max_pages
> + * and number of vcpus set for domain.
> + * Verifies that single allocation does not exceed
> + * PAGE_SIZE.
> + */
> +static int vnuma_alloc(struct vnuma_info **vnuma,
> +                       unsigned int nr_vnodes,
> +                       unsigned int nr_vcpus,
> +                       unsigned int dist_size)
> +{
> +    struct vnuma_info *v;
> +
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
> +    /*
> +     * check if any of xmallocs exeeds PAGE_SIZE.
> +     * If yes, consider it as an error for now.
> +     */
> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
> +        return -EINVAL;
> +
> +    v = xzalloc(struct vnuma_info);
> +    if ( !v )
> +        return -ENOMEM;
> +
> +    v->vdistance = xmalloc_array(unsigned int, dist_size);
> +    v->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
> +    v->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
> +    v->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
> +
> +    if ( v->vdistance == NULL || v->vmemrange == NULL ||
> +        v->vcpu_to_vnode == NULL || v->vnode_to_pnode == NULL )
> +    {
> +        vnuma_destroy(v);
> +        return -ENOMEM;
> +    }
> +
> +    *vnuma = v;
> +
> +    return 0;
> +}
> +
> +/*
> + * Allocate memory and construct one vNUMA node,
> + * set default parameters, assign all memory and
> + * vcpus to this node, set distance to 10.
> + */
> +static long vnuma_fallback(const struct domain *d,
> +                          struct vnuma_info **vnuma)
> +{
> +    struct vnuma_info *v;
> +    long ret;
> +
> +
> +    /* Will not destroy vNUMA here, destroy before calling this. */
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
> +    if ( ret )
> +        return ret;
> +
> +    v->vmemrange[0].start = 0;
> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
> +    v->vdistance[0] = 10;
> +    v->vnode_to_pnode[0] = NUMA_NO_NODE;
> +    memset(v->vcpu_to_vnode, 0, d->max_vcpus);
> +    v->nr_vnodes = 1;
> +
> +    *vnuma = v;
> +
> +    return 0;
> +}
> +
> +/*
> + * construct vNUMA topology form u_vnuma struct and return
> + * it in dst.
> + */
> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
> +                const struct domain *d,
> +                struct vnuma_info **dst)
> +{
> +    unsigned int dist_size, nr_vnodes = 0;
> +    long ret;
> +    struct vnuma_info *v = NULL;
> +
> +    ret = -EINVAL;
> +
> +    /* If vNUMA topology already set, just exit. */
> +    if ( !u_vnuma || *dst )
> +        return ret;
> +
> +    nr_vnodes = u_vnuma->nr_vnodes;
> +
> +    if ( nr_vnodes == 0 )
> +        return ret;
> +
> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
> +        return ret;
> +
> +    dist_size = nr_vnodes * nr_vnodes;
> +
> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
> +    if ( ret )
> +        return ret;
> +
> +    /* On failure, set only one vNUMA node and its success. */
> +    ret = 0;
> +
> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
> +        d->max_vcpus) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
> +        nr_vnodes) )
> +        goto vnuma_onenode;
> +
> +    v->nr_vnodes = nr_vnodes;
> +    *dst = v;
> +
> +    return ret;
> +
> +vnuma_onenode:
> +    vnuma_destroy(v);
> +    return vnuma_fallback(d, dst);
> +}
> +
>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>  {
>      long ret = 0;
> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        struct vnuma_info *v = NULL;
> +
> +        ret = -EFAULT;
> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
> +            return ret;
> +
> +        ret = -EINVAL;
> +
> +        ret = vnuma_init(&op->u.vnuma, d, &v);
> +        if ( ret < 0 || v == NULL )
> +            break;
> +
> +        /* overwrite vnuma for domain */
> +        if ( !d->vnuma )

You want that in within the domain_lock.

Otherwise an caller (on another CPU) could try to read the
d->vnuma and blow up. Say by using the serial console and
wanting to read the guest vNUMA topology.

> +            vnuma_destroy(d->vnuma);
> +
> +        domain_lock(d);

I would just do

	vnuma_destroy(d->vnuma)

here and remove the 'if' above.
> +        d->vnuma = v;
> +        domain_unlock(d);
> +
> +        ret = 0;
> +    }
> +    break;
> +
>      default:
>          ret = arch_do_domctl(op, d, u_domctl);
>          break;
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index c2dd31b..925b9fc 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -969,6 +969,68 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          break;
>  
> +    case XENMEM_get_vnumainfo:
> +    {
> +        struct vnuma_topology_info topology;
> +        struct domain *d;
> +        unsigned int dom_vnodes = 0;
> +
> +        /*
> +         * guest passes nr_vnodes and nr_vcpus thus
> +         * we know how much memory guest has allocated.
> +         */
> +        if ( copy_from_guest(&topology, arg, 1) ||
> +            guest_handle_is_null(topology.vmemrange.h) ||
> +            guest_handle_is_null(topology.vdistance.h) ||
> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )
> +            return -EFAULT;
> +
> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
> +            return -ESRCH;
> +
> +        rc = -EOPNOTSUPP;
> +        if ( d->vnuma == NULL )
> +            goto vnumainfo_out;
> +
> +        if ( d->vnuma->nr_vnodes == 0 )
> +            goto vnumainfo_out;
> +
> +        dom_vnodes = d->vnuma->nr_vnodes;
> +
> +        /*
> +         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
> +         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
> +         */
> +        rc = -ENOBUFS;
> +        if ( topology.nr_vnodes < dom_vnodes ||
> +            topology.nr_vcpus < d->max_vcpus )
> +            goto vnumainfo_out;
> +
> +        rc = -EFAULT;
> +
> +        if ( copy_to_guest(topology.vmemrange.h, d->vnuma->vmemrange,
> +                           dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vdistance.h, d->vnuma->vdistance,
> +                           dom_vnodes * dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vcpu_to_vnode.h, d->vnuma->vcpu_to_vnode,
> +                           d->max_vcpus) != 0 )
> +            goto vnumainfo_out;
> +
> +        topology.nr_vnodes = dom_vnodes;
> +
> +        if ( copy_to_guest(arg, &topology, 1) != 0 )
> +            goto vnumainfo_out;
> +        rc = 0;
> +
> + vnumainfo_out:
> +        rcu_unlock_domain(d);
> +        break;
> +    }
> +
>      default:
>          rc = arch_memory_op(cmd, arg);
>          break;
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 5b11bbf..5ee74f4 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -35,6 +35,7 @@
>  #include "xen.h"
>  #include "grant_table.h"
>  #include "hvm/save.h"
> +#include "memory.h"
>  
>  #define XEN_DOMCTL_INTERFACE_VERSION 0x0000000a
>  
> @@ -934,6 +935,32 @@ struct xen_domctl_vcpu_msrs {
>  };
>  typedef struct xen_domctl_vcpu_msrs xen_domctl_vcpu_msrs_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpu_msrs_t);
> +
> +/*
> + * Use in XEN_DOMCTL_setvnumainfo to set
> + * vNUMA domain topology.
> + */
> +struct xen_domctl_vnuma {
> +    uint32_t nr_vnodes;
> +    uint32_t _pad;
> +    XEN_GUEST_HANDLE_64(uint) vdistance;
> +    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
> +
> +    /*
> +     * vnodes to physical NUMA nodes mask.
> +     * This kept on per-domain basis for
> +     * interested consumers, such as numa aware ballooning.
> +     */
> +    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
> +
> +    /*
> +     * memory rages for each vNUMA node
> +     */
> +    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
> +};
> +typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
> +
>  #endif
>  
>  struct xen_domctl {
> @@ -1008,6 +1035,7 @@ struct xen_domctl {
>  #define XEN_DOMCTL_cacheflush                    71
>  #define XEN_DOMCTL_get_vcpu_msrs                 72
>  #define XEN_DOMCTL_set_vcpu_msrs                 73
> +#define XEN_DOMCTL_setvnumainfo                  74
>  #define XEN_DOMCTL_gdbsx_guestmemio            1000
>  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -1068,6 +1096,7 @@ struct xen_domctl {
>          struct xen_domctl_cacheflush        cacheflush;
>          struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
>          struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
> +        struct xen_domctl_vnuma             vnuma;
>          uint8_t                             pad[128];
>      } u;
>  };
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index 2c57aa0..2c212e1 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -521,9 +521,54 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>   * The zero value is appropiate.
>   */
>  
> +/* vNUMA node memory range */
> +struct vmemrange {
> +    uint64_t start, end;
> +};
> +
> +typedef struct vmemrange vmemrange_t;
> +DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
> +
> +/*
> + * vNUMA topology specifies vNUMA node number, distance table,
> + * memory ranges and vcpu mapping provided for guests.
> + * XENMEM_get_vnumainfo hypercall expects to see from guest
> + * nr_vnodes and nr_vcpus to indicate available memory. After
> + * filling guests structures, nr_vnodes and nr_vcpus copied
> + * back to guest.
> + */
> +struct vnuma_topology_info {
> +    /* IN */
> +    domid_t domid;
> +    /* IN/OUT */
> +    unsigned int nr_vnodes;
> +    unsigned int nr_vcpus;
> +    /* OUT */
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t pad;
> +    } vdistance;
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t pad;
> +    } vcpu_to_vnode;
> +    union {
> +        XEN_GUEST_HANDLE(vmemrange_t) h;
> +        uint64_t pad;
> +    } vmemrange;
> +};
> +typedef struct vnuma_topology_info vnuma_topology_info_t;
> +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
> +
> +/*
> + * XENMEM_get_vnumainfo used by guest to get
> + * vNUMA topology from hypervisor.
> + */
> +#define XENMEM_get_vnumainfo               26
> +
>  #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>  
> -/* Next available subop number is 26 */
> +/* Next available subop number is 27 */
>  
>  #endif /* __XEN_PUBLIC_MEMORY_H__ */
>  
> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
> index bb1c398..d29a84d 100644
> --- a/xen/include/xen/domain.h
> +++ b/xen/include/xen/domain.h
> @@ -89,4 +89,15 @@ extern unsigned int xen_processor_pmbits;
>  
>  extern bool_t opt_dom0_vcpus_pin;
>  
> +/* vnuma topology per domain. */
> +struct vnuma_info {
> +    unsigned int nr_vnodes;
> +    unsigned int *vdistance;
> +    unsigned int *vcpu_to_vnode;
> +    unsigned int *vnode_to_pnode;
> +    struct vmemrange *vmemrange;
> +};
> +
> +void vnuma_destroy(struct vnuma_info *vnuma);
> +
>  #endif /* __XEN_DOMAIN_H__ */
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index d5bc461..71e4218 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -447,6 +447,7 @@ struct domain
>      nodemask_t node_affinity;
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
> +    struct vnuma_info *vnuma;
>  };
>  
>  struct domain_setup_info
> -- 
> 1.7.10.4
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/10] xsm bits for vNUMA hypercalls
  2014-07-18  5:50 ` [PATCH v6 02/10] xsm bits for vNUMA hypercalls Elena Ufimtseva
@ 2014-07-18 13:50   ` Konrad Rzeszutek Wilk
  2014-07-18 15:26     ` Daniel De Graaf
  0 siblings, 1 reply; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-18 13:50 UTC (permalink / raw)
  To: Elena Ufimtseva, dgdegra
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, Jul 18, 2014 at 01:50:01AM -0400, Elena Ufimtseva wrote:
> Define xsm_get_vnumainfo hypercall used for domain which
> wish to receive vnuma topology. Add xsm hook for
> XEN_DOMCTL_setvnumainfo. Also adds basic policies.

CC-ing Daniel - the XSM maintainer.

> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  xen/common/memory.c                 |    7 +++++++
>  xen/include/xsm/dummy.h             |    6 ++++++
>  xen/include/xsm/xsm.h               |    7 +++++++
>  xen/xsm/dummy.c                     |    1 +
>  xen/xsm/flask/hooks.c               |   10 ++++++++++
>  xen/xsm/flask/policy/access_vectors |    4 ++++
>  6 files changed, 35 insertions(+)
> 
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 925b9fc..9a87aa8 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -988,6 +988,13 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
>              return -ESRCH;
>  
> +        rc = xsm_get_vnumainfo(XSM_PRIV, d);
> +        if ( rc )
> +        {
> +            rcu_unlock_domain(d);
> +            return rc;
> +        }
> +
>          rc = -EOPNOTSUPP;
>          if ( d->vnuma == NULL )
>              goto vnumainfo_out;
> diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
> index c5aa316..4262fd8 100644
> --- a/xen/include/xsm/dummy.h
> +++ b/xen/include/xsm/dummy.h
> @@ -317,6 +317,12 @@ static XSM_INLINE int xsm_set_pod_target(XSM_DEFAULT_ARG struct domain *d)
>      return xsm_default_action(action, current->domain, d);
>  }
>  
> +static XSM_INLINE int xsm_get_vnumainfo(XSM_DEFAULT_ARG struct domain *d)
> +{
> +    XSM_ASSERT_ACTION(XSM_PRIV);
> +    return xsm_default_action(action, current->domain, d);
> +}
> +
>  #if defined(HAS_PASSTHROUGH) && defined(HAS_PCI)
>  static XSM_INLINE int xsm_get_device_group(XSM_DEFAULT_ARG uint32_t machine_bdf)
>  {
> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
> index a85045d..c7ec562 100644
> --- a/xen/include/xsm/xsm.h
> +++ b/xen/include/xsm/xsm.h
> @@ -169,6 +169,7 @@ struct xsm_operations {
>      int (*unbind_pt_irq) (struct domain *d, struct xen_domctl_bind_pt_irq *bind);
>      int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
>      int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
> +    int (*get_vnumainfo) (struct domain *d);
>  #endif
>  };
>  
> @@ -653,6 +654,12 @@ static inline int xsm_ioport_mapping (xsm_default_t def, struct domain *d, uint3
>  {
>      return xsm_ops->ioport_mapping(d, s, e, allow);
>  }
> +
> +static inline int xsm_get_vnumainfo (xsm_default_t def, struct domain *d)
> +{
> +    return xsm_ops->get_vnumainfo(d);
> +}
> +
>  #endif /* CONFIG_X86 */
>  
>  #endif /* XSM_NO_WRAPPERS */
> diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
> index c95c803..0826a8b 100644
> --- a/xen/xsm/dummy.c
> +++ b/xen/xsm/dummy.c
> @@ -85,6 +85,7 @@ void xsm_fixup_ops (struct xsm_operations *ops)
>      set_to_dummy_if_null(ops, iomem_permission);
>      set_to_dummy_if_null(ops, iomem_mapping);
>      set_to_dummy_if_null(ops, pci_config_permission);
> +    set_to_dummy_if_null(ops, get_vnumainfo);
>  
>  #if defined(HAS_PASSTHROUGH) && defined(HAS_PCI)
>      set_to_dummy_if_null(ops, get_device_group);
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index f2f59ea..00efba1 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -404,6 +404,11 @@ static int flask_claim_pages(struct domain *d)
>      return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__SETCLAIM);
>  }
>  
> +static int flask_get_vnumainfo(struct domain *d)
> +{
> +    return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__GET_VNUMAINFO);
> +}
> +
>  static int flask_console_io(struct domain *d, int cmd)
>  {
>      u32 perm;
> @@ -715,6 +720,9 @@ static int flask_domctl(struct domain *d, int cmd)
>      case XEN_DOMCTL_cacheflush:
>          return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__CACHEFLUSH);
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +        return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN2__SET_VNUMAINFO);
> +
>      default:
>          printk("flask_domctl: Unknown op %d\n", cmd);
>          return -EPERM;
> @@ -1552,6 +1560,8 @@ static struct xsm_operations flask_ops = {
>      .hvm_param_nested = flask_hvm_param_nested,
>  
>      .do_xsm_op = do_flask_op,
> +    .get_vnumainfo = flask_get_vnumainfo,
> +
>  #ifdef CONFIG_COMPAT
>      .do_compat_op = compat_flask_op,
>  #endif
> diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
> index 32371a9..d279841 100644
> --- a/xen/xsm/flask/policy/access_vectors
> +++ b/xen/xsm/flask/policy/access_vectors
> @@ -200,6 +200,10 @@ class domain2
>      cacheflush
>  # Creation of the hardware domain when it is not dom0
>      create_hardware_domain
> +# XEN_DOMCTL_setvnumainfo
> +    set_vnumainfo
> +# XENMEM_getvnumainfo
> +    get_vnumainfo
>  }
>  
>  # Similar to class domain, but primarily contains domctls related to HVM domains
> -- 
> 1.7.10.4
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/10] xsm bits for vNUMA hypercalls
  2014-07-18 13:50   ` Konrad Rzeszutek Wilk
@ 2014-07-18 15:26     ` Daniel De Graaf
  2014-07-20 13:48       ` Elena Ufimtseva
  0 siblings, 1 reply; 63+ messages in thread
From: Daniel De Graaf @ 2014-07-18 15:26 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: lccycc123, keir, Ian.Campbell, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel,
	JBeulich

On 07/18/2014 09:50 AM, Konrad Rzeszutek Wilk wrote:
> On Fri, Jul 18, 2014 at 01:50:01AM -0400, Elena Ufimtseva wrote:
>> Define xsm_get_vnumainfo hypercall used for domain which
>> wish to receive vnuma topology. Add xsm hook for
>> XEN_DOMCTL_setvnumainfo. Also adds basic policies.

In order to add basic policies, you should also modify the example
XSM policy in tools/flask/policy/policy/modules/xen/xen.{if,te} and add
the permission to the create_domain and/or manage_domain macros.
Otherwise, the commit message should only refer to adding XSM checks
(and the example policy will have to be modified later, so it's really
preferable to update them both now).

[...]
>> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
>> index a85045d..c7ec562 100644
>> --- a/xen/include/xsm/xsm.h
>> +++ b/xen/include/xsm/xsm.h
>> @@ -169,6 +169,7 @@ struct xsm_operations {
>>       int (*unbind_pt_irq) (struct domain *d, struct xen_domctl_bind_pt_irq *bind);
>>       int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
>>       int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e, uint8_t allow);
>> +    int (*get_vnumainfo) (struct domain *d);
>>   #endif
>>   };
>>
>> @@ -653,6 +654,12 @@ static inline int xsm_ioport_mapping (xsm_default_t def, struct domain *d, uint3
>>   {
>>       return xsm_ops->ioport_mapping(d, s, e, allow);
>>   }
>> +
>> +static inline int xsm_get_vnumainfo (xsm_default_t def, struct domain *d)
>> +{
>> +    return xsm_ops->get_vnumainfo(d);
>> +}
>> +
>>   #endif /* CONFIG_X86 */

Both of these need to be moved outside of #ifdef CONFIG_X86 since the
hook is not x86-specific. The other files' changes look correct.

-- 
Daniel De Graaf
National Security Agency

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/10] libxl: build numa nodes memory blocks
  2014-07-18 11:01   ` Wei Liu
@ 2014-07-20 12:58     ` Elena Ufimtseva
  2014-07-20 15:59       ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 12:58 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich

On Fri, Jul 18, 2014 at 7:01 AM, Wei Liu <wei.liu2@citrix.com> wrote:
>
> On Fri, Jul 18, 2014 at 01:50:07AM -0400, Elena Ufimtseva wrote:
> [...]
> >
> > +bool libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info);
> > +
> > +int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> > +                                  uint32_t *nr_entries,
> > +                                  unsigned long map_limitkb,
> > +                                  unsigned long balloon_kb);
> > +
>
Hi Wei

Thanks for comments.

>
> e820_sanitize should not take a ctx? It's internal function anyway.
> But this is not your fault so don't worry about it.
>
> And this function seems to be arch-specific so I wonder if there's
> better place for it.


I do feel the same way. Would be libxl_arch.h is a better place?
>
>
> > +int libxl__vnuma_align_mem(libxl__gc *gc,
> > +                                     uint32_t domid,
> > +                                     struct libxl_domain_build_info *b_info,
> > +                                     vmemrange_t *memblks);
> > +
>
> Indentation.
>
> > +/*
> > + * For each node, build memory block start and end addresses.
> > + * Substract any memory hole from the range found in e820 map.
> > + * vnode memory size are passed here in megabytes, the result is
> > + * in memory block addresses.
> > + * Linux kernel will adjust numa memory block sizes on its own.
> > + * But we want to provide to the kernel numa block addresses that
> > + * will be the same in kernel and hypervisor.
> > + */
> > +#define max(a,b) ((a > b) ? a : b)
>
> You won't need to redefine max, I think we have this somewhere already.

I think there is a patch in libxc (from Andrew Cooper) "tools/libxc:
Shuffle definitions and uses of min()/max() macros"
This one you are talking about?
>
>
> (Haven't looked into the placement strategy, I guess you, Dario and
> George have come to agreement on how this should be implemented)


Yes, will wait on Dario or George.
>
>
> Wei.




-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18 10:30   ` Wei Liu
@ 2014-07-20 13:16     ` Elena Ufimtseva
  2014-07-20 15:59       ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 13:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich

On Fri, Jul 18, 2014 at 6:30 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Fri, Jul 18, 2014 at 01:50:00AM -0400, Elena Ufimtseva wrote:
> [...]
>> +/*
>> + * Allocate memory and construct one vNUMA node,
>> + * set default parameters, assign all memory and
>> + * vcpus to this node, set distance to 10.
>> + */
>> +static long vnuma_fallback(const struct domain *d,
>> +                          struct vnuma_info **vnuma)
>> +{
>> +    struct vnuma_info *v;
>> +    long ret;
>> +
>> +
>> +    /* Will not destroy vNUMA here, destroy before calling this. */
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    v->vmemrange[0].start = 0;
>> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
>> +    v->vdistance[0] = 10;
>> +    v->vnode_to_pnode[0] = NUMA_NO_NODE;
>> +    memset(v->vcpu_to_vnode, 0, d->max_vcpus);
>> +    v->nr_vnodes = 1;
>> +
>> +    *vnuma = v;
>> +
>> +    return 0;
>> +}
>> +
>
> I have question about this strategy. Is there any reason to choose to
> fallback to this one node? In that case the toolstack will have
> different view of the guest than the hypervisor. Toolstack still thinks
> this guest has several nodes while this guest has only one. The can
> cause problem when migrating a guest. Consider this, toolstack on the
> remote end still builds two nodes given the fact that it's what it
> knows, then the guest originally has one node notices the change in
> underlying memory topology and crashes.
>
> IMHO we should just fail in this case. It's not that common to fail a
> small array allocation anyway. This approach can also save you from
> writing this function. :-)

I see and agree )

Do you mean fail as to not set any vnuma for domain? If yes, it sort of
contradicts with statement 'every pv domain has at least one vnuma node'.
Would be it reasonable on failed call to xc_domain_setvnuma from libxl
to fallback to
one node in tool stack as well?

>
>> +/*
>> + * construct vNUMA topology form u_vnuma struct and return
>> + * it in dst.
>> + */
> [...]
>> +
>> +    /* On failure, set only one vNUMA node and its success. */
>> +    ret = 0;
>> +
>> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
>> +        d->max_vcpus) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
>> +        nr_vnodes) )
>> +        goto vnuma_onenode;
>> +
>> +    v->nr_vnodes = nr_vnodes;
>> +    *dst = v;
>> +
>> +    return ret;
>> +
>> +vnuma_onenode:
>> +    vnuma_destroy(v);
>> +    return vnuma_fallback(d, dst);
>> +}
>> +
>>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>  {
>>      long ret = 0;
>> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>      }
>>      break;
>>
> [...]
>> +/*
>> + * vNUMA topology specifies vNUMA node number, distance table,
>> + * memory ranges and vcpu mapping provided for guests.
>> + * XENMEM_get_vnumainfo hypercall expects to see from guest
>> + * nr_vnodes and nr_vcpus to indicate available memory. After
>> + * filling guests structures, nr_vnodes and nr_vcpus copied
>> + * back to guest.
>> + */
>> +struct vnuma_topology_info {
>> +    /* IN */
>> +    domid_t domid;
>> +    /* IN/OUT */
>> +    unsigned int nr_vnodes;
>> +    unsigned int nr_vcpus;
>> +    /* OUT */
>> +    union {
>> +        XEN_GUEST_HANDLE(uint) h;
>> +        uint64_t pad;
>> +    } vdistance;
>> +    union {
>> +        XEN_GUEST_HANDLE(uint) h;
>> +        uint64_t pad;
>> +    } vcpu_to_vnode;
>> +    union {
>> +        XEN_GUEST_HANDLE(vmemrange_t) h;
>> +        uint64_t pad;
>> +    } vmemrange;
>
> Why do you need to use union? The other interface you introduce in this
> patch doesn't use union.

This is one is for making sure on 32 and 64 bits the structures are of
the same size.


>
> Wei.



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18 13:49   ` Konrad Rzeszutek Wilk
@ 2014-07-20 13:26     ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 13:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich

On Fri, Jul 18, 2014 at 9:49 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Jul 18, 2014 at 01:50:00AM -0400, Elena Ufimtseva wrote:
>> Define interface, structures and hypercalls for toolstack to
>> build vnuma topology and for guests that wish to retrieve it.
>> Two subop hypercalls introduced by patch:
>> XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
>> and XENMEM_get_vnumainfo to retrieve that topology by guest.
>>
>> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
>> ---
>>  xen/common/domain.c         |   13 ++++
>>  xen/common/domctl.c         |  167 +++++++++++++++++++++++++++++++++++++++++++
>>  xen/common/memory.c         |   62 ++++++++++++++++
>>  xen/include/public/domctl.h |   29 ++++++++
>>  xen/include/public/memory.h |   47 +++++++++++-
>>  xen/include/xen/domain.h    |   11 +++
>>  xen/include/xen/sched.h     |    1 +
>>  7 files changed, 329 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index cd64aea..895584a 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -584,6 +584,18 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
>>      return 0;
>>  }
>>
>> +void vnuma_destroy(struct vnuma_info *vnuma)
>> +{
>> +    if ( vnuma )
>> +    {
>> +        xfree(vnuma->vmemrange);
>> +        xfree(vnuma->vcpu_to_vnode);
>> +        xfree(vnuma->vdistance);
>> +        xfree(vnuma->vnode_to_pnode);
>> +        xfree(vnuma);
>> +    }
>> +}
>> +
>>  int domain_kill(struct domain *d)
>>  {
>>      int rc = 0;
>> @@ -602,6 +614,7 @@ int domain_kill(struct domain *d)
>>          evtchn_destroy(d);
>>          gnttab_release_mappings(d);
>>          tmem_destroy(d->tmem_client);
>> +        vnuma_destroy(d->vnuma);
>>          domain_set_outstanding_pages(d, 0);
>>          d->tmem_client = NULL;
>>          /* fallthrough */
>> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
>> index c326aba..7464284 100644
>> --- a/xen/common/domctl.c
>> +++ b/xen/common/domctl.c
>> @@ -297,6 +297,144 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
>>              guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
>>  }
>>
>> +/*
>> + * Allocates memory for vNUMA, **vnuma should be NULL.
>> + * Caller has to make sure that domain has max_pages
>> + * and number of vcpus set for domain.
>> + * Verifies that single allocation does not exceed
>> + * PAGE_SIZE.
>> + */
>> +static int vnuma_alloc(struct vnuma_info **vnuma,
>> +                       unsigned int nr_vnodes,
>> +                       unsigned int nr_vcpus,
>> +                       unsigned int dist_size)
>> +{
>> +    struct vnuma_info *v;
>> +
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>> +    /*
>> +     * check if any of xmallocs exeeds PAGE_SIZE.
>> +     * If yes, consider it as an error for now.
>> +     */
>> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
>> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
>> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
>> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
>> +        return -EINVAL;
>> +
>> +    v = xzalloc(struct vnuma_info);
>> +    if ( !v )
>> +        return -ENOMEM;
>> +
>> +    v->vdistance = xmalloc_array(unsigned int, dist_size);
>> +    v->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
>> +    v->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
>> +    v->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
>> +
>> +    if ( v->vdistance == NULL || v->vmemrange == NULL ||
>> +        v->vcpu_to_vnode == NULL || v->vnode_to_pnode == NULL )
>> +    {
>> +        vnuma_destroy(v);
>> +        return -ENOMEM;
>> +    }
>> +
>> +    *vnuma = v;
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Allocate memory and construct one vNUMA node,
>> + * set default parameters, assign all memory and
>> + * vcpus to this node, set distance to 10.
>> + */
>> +static long vnuma_fallback(const struct domain *d,
>> +                          struct vnuma_info **vnuma)
>> +{
>> +    struct vnuma_info *v;
>> +    long ret;
>> +
>> +
>> +    /* Will not destroy vNUMA here, destroy before calling this. */
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    v->vmemrange[0].start = 0;
>> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
>> +    v->vdistance[0] = 10;
>> +    v->vnode_to_pnode[0] = NUMA_NO_NODE;
>> +    memset(v->vcpu_to_vnode, 0, d->max_vcpus);
>> +    v->nr_vnodes = 1;
>> +
>> +    *vnuma = v;
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * construct vNUMA topology form u_vnuma struct and return
>> + * it in dst.
>> + */
>> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
>> +                const struct domain *d,
>> +                struct vnuma_info **dst)
>> +{
>> +    unsigned int dist_size, nr_vnodes = 0;
>> +    long ret;
>> +    struct vnuma_info *v = NULL;
>> +
>> +    ret = -EINVAL;
>> +
>> +    /* If vNUMA topology already set, just exit. */
>> +    if ( !u_vnuma || *dst )
>> +        return ret;
>> +
>> +    nr_vnodes = u_vnuma->nr_vnodes;
>> +
>> +    if ( nr_vnodes == 0 )
>> +        return ret;
>> +
>> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
>> +        return ret;
>> +
>> +    dist_size = nr_vnodes * nr_vnodes;
>> +
>> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    /* On failure, set only one vNUMA node and its success. */
>> +    ret = 0;
>> +
>> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
>> +        d->max_vcpus) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
>> +        nr_vnodes) )
>> +        goto vnuma_onenode;
>> +
>> +    v->nr_vnodes = nr_vnodes;
>> +    *dst = v;
>> +
>> +    return ret;
>> +
>> +vnuma_onenode:
>> +    vnuma_destroy(v);
>> +    return vnuma_fallback(d, dst);
>> +}
>> +
>>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>  {
>>      long ret = 0;
>> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>      }
>>      break;
>>
>> +    case XEN_DOMCTL_setvnumainfo:
>> +    {
>> +        struct vnuma_info *v = NULL;
>> +
>> +        ret = -EFAULT;
>> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
>> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
>> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
>> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
>> +            return ret;
>> +
>> +        ret = -EINVAL;
>> +
>> +        ret = vnuma_init(&op->u.vnuma, d, &v);
>> +        if ( ret < 0 || v == NULL )
>> +            break;
>> +
>> +        /* overwrite vnuma for domain */
>> +        if ( !d->vnuma )
>
> You want that in within the domain_lock.
>
> Otherwise an caller (on another CPU) could try to read the
> d->vnuma and blow up. Say by using the serial console and
> wanting to read the guest vNUMA topology.
>
>> +            vnuma_destroy(d->vnuma);
>> +
>> +        domain_lock(d);
>
> I would just do
>
>         vnuma_destroy(d->vnuma)
>
> here and remove the 'if' above.
>> +        d->vnuma = v;
>> +        domain_unlock(d);
>> +
>> +        ret = 0;
>> +    }
>> +    break;
>> +

Agree and done )

>>      default:
>>          ret = arch_do_domctl(op, d, u_domctl);
>>          break;
>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>> index c2dd31b..925b9fc 100644
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -969,6 +969,68 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>
>>          break;
>>
>> +    case XENMEM_get_vnumainfo:
>> +    {
>> +        struct vnuma_topology_info topology;
>> +        struct domain *d;
>> +        unsigned int dom_vnodes = 0;
>> +
>> +        /*
>> +         * guest passes nr_vnodes and nr_vcpus thus
>> +         * we know how much memory guest has allocated.
>> +         */
>> +        if ( copy_from_guest(&topology, arg, 1) ||
>> +            guest_handle_is_null(topology.vmemrange.h) ||
>> +            guest_handle_is_null(topology.vdistance.h) ||
>> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )
>> +            return -EFAULT;
>> +
>> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
>> +            return -ESRCH;
>> +
>> +        rc = -EOPNOTSUPP;
>> +        if ( d->vnuma == NULL )
>> +            goto vnumainfo_out;
>> +
>> +        if ( d->vnuma->nr_vnodes == 0 )
>> +            goto vnumainfo_out;
>> +
>> +        dom_vnodes = d->vnuma->nr_vnodes;
>> +
>> +        /*
>> +         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
>> +         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
>> +         */
>> +        rc = -ENOBUFS;
>> +        if ( topology.nr_vnodes < dom_vnodes ||
>> +            topology.nr_vcpus < d->max_vcpus )
>> +            goto vnumainfo_out;
>> +
>> +        rc = -EFAULT;
>> +
>> +        if ( copy_to_guest(topology.vmemrange.h, d->vnuma->vmemrange,
>> +                           dom_vnodes) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        if ( copy_to_guest(topology.vdistance.h, d->vnuma->vdistance,
>> +                           dom_vnodes * dom_vnodes) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        if ( copy_to_guest(topology.vcpu_to_vnode.h, d->vnuma->vcpu_to_vnode,
>> +                           d->max_vcpus) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        topology.nr_vnodes = dom_vnodes;
>> +
>> +        if ( copy_to_guest(arg, &topology, 1) != 0 )
>> +            goto vnumainfo_out;
>> +        rc = 0;
>> +
>> + vnumainfo_out:
>> +        rcu_unlock_domain(d);
>> +        break;
>> +    }
>> +
>>      default:
>>          rc = arch_memory_op(cmd, arg);
>>          break;
>> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
>> index 5b11bbf..5ee74f4 100644
>> --- a/xen/include/public/domctl.h
>> +++ b/xen/include/public/domctl.h
>> @@ -35,6 +35,7 @@
>>  #include "xen.h"
>>  #include "grant_table.h"
>>  #include "hvm/save.h"
>> +#include "memory.h"
>>
>>  #define XEN_DOMCTL_INTERFACE_VERSION 0x0000000a
>>
>> @@ -934,6 +935,32 @@ struct xen_domctl_vcpu_msrs {
>>  };
>>  typedef struct xen_domctl_vcpu_msrs xen_domctl_vcpu_msrs_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpu_msrs_t);
>> +
>> +/*
>> + * Use in XEN_DOMCTL_setvnumainfo to set
>> + * vNUMA domain topology.
>> + */
>> +struct xen_domctl_vnuma {
>> +    uint32_t nr_vnodes;
>> +    uint32_t _pad;
>> +    XEN_GUEST_HANDLE_64(uint) vdistance;
>> +    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
>> +
>> +    /*
>> +     * vnodes to physical NUMA nodes mask.
>> +     * This kept on per-domain basis for
>> +     * interested consumers, such as numa aware ballooning.
>> +     */
>> +    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
>> +
>> +    /*
>> +     * memory rages for each vNUMA node
>> +     */
>> +    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
>> +};
>> +typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
>> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
>> +
>>  #endif
>>
>>  struct xen_domctl {
>> @@ -1008,6 +1035,7 @@ struct xen_domctl {
>>  #define XEN_DOMCTL_cacheflush                    71
>>  #define XEN_DOMCTL_get_vcpu_msrs                 72
>>  #define XEN_DOMCTL_set_vcpu_msrs                 73
>> +#define XEN_DOMCTL_setvnumainfo                  74
>>  #define XEN_DOMCTL_gdbsx_guestmemio            1000
>>  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>>  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
>> @@ -1068,6 +1096,7 @@ struct xen_domctl {
>>          struct xen_domctl_cacheflush        cacheflush;
>>          struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
>>          struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
>> +        struct xen_domctl_vnuma             vnuma;
>>          uint8_t                             pad[128];
>>      } u;
>>  };
>> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
>> index 2c57aa0..2c212e1 100644
>> --- a/xen/include/public/memory.h
>> +++ b/xen/include/public/memory.h
>> @@ -521,9 +521,54 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>>   * The zero value is appropiate.
>>   */
>>
>> +/* vNUMA node memory range */
>> +struct vmemrange {
>> +    uint64_t start, end;
>> +};
>> +
>> +typedef struct vmemrange vmemrange_t;
>> +DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
>> +
>> +/*
>> + * vNUMA topology specifies vNUMA node number, distance table,
>> + * memory ranges and vcpu mapping provided for guests.
>> + * XENMEM_get_vnumainfo hypercall expects to see from guest
>> + * nr_vnodes and nr_vcpus to indicate available memory. After
>> + * filling guests structures, nr_vnodes and nr_vcpus copied
>> + * back to guest.
>> + */
>> +struct vnuma_topology_info {
>> +    /* IN */
>> +    domid_t domid;
>> +    /* IN/OUT */
>> +    unsigned int nr_vnodes;
>> +    unsigned int nr_vcpus;
>> +    /* OUT */
>> +    union {
>> +        XEN_GUEST_HANDLE(uint) h;
>> +        uint64_t pad;
>> +    } vdistance;
>> +    union {
>> +        XEN_GUEST_HANDLE(uint) h;
>> +        uint64_t pad;
>> +    } vcpu_to_vnode;
>> +    union {
>> +        XEN_GUEST_HANDLE(vmemrange_t) h;
>> +        uint64_t pad;
>> +    } vmemrange;
>> +};
>> +typedef struct vnuma_topology_info vnuma_topology_info_t;
>> +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
>> +
>> +/*
>> + * XENMEM_get_vnumainfo used by guest to get
>> + * vNUMA topology from hypervisor.
>> + */
>> +#define XENMEM_get_vnumainfo               26
>> +
>>  #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>>
>> -/* Next available subop number is 26 */
>> +/* Next available subop number is 27 */
>>
>>  #endif /* __XEN_PUBLIC_MEMORY_H__ */
>>
>> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
>> index bb1c398..d29a84d 100644
>> --- a/xen/include/xen/domain.h
>> +++ b/xen/include/xen/domain.h
>> @@ -89,4 +89,15 @@ extern unsigned int xen_processor_pmbits;
>>
>>  extern bool_t opt_dom0_vcpus_pin;
>>
>> +/* vnuma topology per domain. */
>> +struct vnuma_info {
>> +    unsigned int nr_vnodes;
>> +    unsigned int *vdistance;
>> +    unsigned int *vcpu_to_vnode;
>> +    unsigned int *vnode_to_pnode;
>> +    struct vmemrange *vmemrange;
>> +};
>> +
>> +void vnuma_destroy(struct vnuma_info *vnuma);
>> +
>>  #endif /* __XEN_DOMAIN_H__ */
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index d5bc461..71e4218 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -447,6 +447,7 @@ struct domain
>>      nodemask_t node_affinity;
>>      unsigned int last_alloc_node;
>>      spinlock_t node_affinity_lock;
>> +    struct vnuma_info *vnuma;
>>  };
>>
>>  struct domain_setup_info
>> --
>> 1.7.10.4
>>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/10] xsm bits for vNUMA hypercalls
  2014-07-18 15:26     ` Daniel De Graaf
@ 2014-07-20 13:48       ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 13:48 UTC (permalink / raw)
  To: Daniel De Graaf
  Cc: Li Yechen, Keir Fraser, Ian Campbell, George Dunlap, Matt Wilson,
	Dario Faggioli, Stefano Stabellini, Ian Jackson, xen-devel,
	Jan Beulich

On Fri, Jul 18, 2014 at 11:26 AM, Daniel De Graaf <dgdegra@tycho.nsa.gov> wrote:
> On 07/18/2014 09:50 AM, Konrad Rzeszutek Wilk wrote:
>>
>> On Fri, Jul 18, 2014 at 01:50:01AM -0400, Elena Ufimtseva wrote:
>>>
>>> Define xsm_get_vnumainfo hypercall used for domain which
>>> wish to receive vnuma topology. Add xsm hook for
>>> XEN_DOMCTL_setvnumainfo. Also adds basic policies.
>
>
> In order to add basic policies, you should also modify the example
> XSM policy in tools/flask/policy/policy/modules/xen/xen.{if,te} and add
> the permission to the create_domain and/or manage_domain macros.
> Otherwise, the commit message should only refer to adding XSM checks
> (and the example policy will have to be modified later, so it's really
> preferable to update them both now).
>
> [...]
>
>>> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
>>> index a85045d..c7ec562 100644
>>> --- a/xen/include/xsm/xsm.h
>>> +++ b/xen/include/xsm/xsm.h
>>> @@ -169,6 +169,7 @@ struct xsm_operations {
>>>       int (*unbind_pt_irq) (struct domain *d, struct
>>> xen_domctl_bind_pt_irq *bind);
>>>       int (*ioport_permission) (struct domain *d, uint32_t s, uint32_t e,
>>> uint8_t allow);
>>>       int (*ioport_mapping) (struct domain *d, uint32_t s, uint32_t e,
>>> uint8_t allow);
>>> +    int (*get_vnumainfo) (struct domain *d);
>>>   #endif
>>>   };
>>>
>>> @@ -653,6 +654,12 @@ static inline int xsm_ioport_mapping (xsm_default_t
>>> def, struct domain *d, uint3
>>>   {
>>>       return xsm_ops->ioport_mapping(d, s, e, allow);
>>>   }
>>> +
>>> +static inline int xsm_get_vnumainfo (xsm_default_t def, struct domain
>>> *d)
>>> +{
>>> +    return xsm_ops->get_vnumainfo(d);
>>> +}
>>> +
>>>   #endif /* CONFIG_X86 */
>
>
> Both of these need to be moved outside of #ifdef CONFIG_X86 since the
> hook is not x86-specific. The other files' changes look correct.
>
> --
> Daniel De Graaf
> National Security Agency

Thanks Daniel, done with this. It will appear in the next version.


-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-18 10:53   ` Wei Liu
@ 2014-07-20 14:04     ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 14:04 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich

On Fri, Jul 18, 2014 at 6:53 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Fri, Jul 18, 2014 at 01:50:04AM -0400, Elena Ufimtseva wrote:
>> Parses vnuma topoplogy number of nodes and memory
>> ranges. If not defined, initializes vnuma with
>> only one node and default topology. This one node covers
>> all domain memory and all vcpus assigned to it.
>>
>> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
>> ---
>>  docs/man/xl.cfg.pod.5       |   77 ++++++++
>>  tools/libxl/libxl_types.idl |    6 +-
>>  tools/libxl/libxl_vnuma.h   |    8 +
>>  tools/libxl/xl_cmdimpl.c    |  425 +++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 515 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/libxl/libxl_vnuma.h
>>
>> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
>> index ff9ea77..0c7fbf8 100644
>> --- a/docs/man/xl.cfg.pod.5
>> +++ b/docs/man/xl.cfg.pod.5
>> @@ -242,6 +242,83 @@ if the values of B<memory=> and B<maxmem=> differ.
>>  A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver
>>  it will crash.
>>
>> +=item B<vnuma_nodes=N>
>> +
>> +Number of vNUMA nodes the guest will be initialized with on boot.
>> +PV guest by default will have one vnuma node.
>> +
>
> In the future, all these config options will be used for HVM / PVH
> guests as well. But I'm fine with leaving it as is for the moment.

Sure!
>
> [...]
>>
>>  =head3 Event Actions
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index de25f42..5876822 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -318,7 +318,11 @@ libxl_domain_build_info = Struct("domain_build_info",[
>>      ("disable_migrate", libxl_defbool),
>>      ("cpuid",           libxl_cpuid_policy_list),
>>      ("blkdev_start",    string),
>> -
>> +    ("vnuma_mem",     Array(uint64, "nr_nodes")),
>> +    ("vnuma_vcpumap",     Array(uint32, "nr_nodemap")),
>> +    ("vdistance",        Array(uint32, "nr_dist")),
>> +    ("vnuma_vnodemap",  Array(uint32, "nr_node_to_pnode")),
>
> The main problem here is that we need to name counter variables
> num_VARs. See idl.txt, idl.Array section.

Ok, I will rename these.

>
>> +    ("vnuma_autoplacement",  libxl_defbool),
>>      ("device_model_version", libxl_device_model_version),
>>      ("device_model_stubdomain", libxl_defbool),
>>      # if you set device_model you must set device_model_version too
>> diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
>> new file mode 100644
>> index 0000000..4ff4c57
>> --- /dev/null
>> +++ b/tools/libxl/libxl_vnuma.h
>> @@ -0,0 +1,8 @@
>> +#include "libxl_osdeps.h" /* must come before any other headers */
>> +
>> +#define VNUMA_NO_NODE ~((unsigned int)0)
>> +
>> +/* Max vNUMA node size from Linux. */
>
> Should be "Min" I guess.
>
>> +#define MIN_VNODE_SIZE  32U
>> +
>> +#define MAX_VNUMA_NODES (unsigned int)1 << 10
>
> Does this also come from Linux? Or is it just some arbitrary number?
> Worth a comment here.

Yes, this comes from max NODE_SHIFT from arch/x86/Kconfig. I will
comment on this in the code.

>
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index 68df548..5d91c2c 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -40,6 +40,7 @@
>>  #include "libxl_json.h"
>>  #include "libxlutil.h"
>>  #include "xl.h"
>> +#include "libxl_vnuma.h"
>>
>>  /* For calls which return an errno on failure */
>>  #define CHK_ERRNOVAL( call ) ({                                         \
>> @@ -690,6 +691,423 @@ static void parse_top_level_sdl_options(XLU_Config *config,
>>      xlu_cfg_replace_string (config, "xauthority", &sdl->xauthority, 0);
>>  }
>>
>> +
>> +static unsigned int get_list_item_uint(XLU_ConfigList *list, unsigned int i)
>> +{
>> +    const char *buf;
>> +    char *ep;
>> +    unsigned long ul;
>> +    int rc = -EINVAL;
>> +
>> +    buf = xlu_cfg_get_listitem(list, i);
>> +    if (!buf)
>> +        return rc;
>> +    ul = strtoul(buf, &ep, 10);
>> +    if (ep == buf)
>> +        return rc;
>> +    if (ul >= UINT16_MAX)
>> +        return rc;
>> +    return (unsigned int)ul;
>> +}
>> +
>> +static void vdistance_set(unsigned int *vdistance,
>> +                                unsigned int nr_vnodes,
>> +                                unsigned int samenode,
>> +                                unsigned int othernode)
>> +{
>> +    unsigned int idx, slot;
>> +    for (idx = 0; idx < nr_vnodes; idx++)
>> +        for (slot = 0; slot < nr_vnodes; slot++)
>> +            *(vdistance + slot * nr_vnodes + idx) =
>> +                idx == slot ? samenode : othernode;
>> +}
>> +
>> +static void vcputovnode_default(unsigned int *cpu_to_node,
>> +                                unsigned int nr_vnodes,
>> +                                unsigned int max_vcpus)
>> +{
>> +    unsigned int cpu;
>> +    for (cpu = 0; cpu < max_vcpus; cpu++)
>> +        cpu_to_node[cpu] = cpu % nr_vnodes;
>> +}
>> +
>> +/* Split domain memory between vNUMA nodes equally. */
>> +static int split_vnumamem(libxl_domain_build_info *b_info)
>> +{
>> +    unsigned long long vnodemem = 0;
>> +    unsigned long n;
>> +    unsigned int i;
>> +
>> +    if (b_info->nr_nodes == 0)
>> +        return -1;
>> +
>> +    vnodemem = (b_info->max_memkb >> 10) / b_info->nr_nodes;
>> +    if (vnodemem < MIN_VNODE_SIZE)
>> +        return -1;
>> +    /* reminder in MBytes. */
>> +    n = (b_info->max_memkb >> 10) % b_info->nr_nodes;
>> +    /* get final sizes in MBytes. */
>> +    for (i = 0; i < (b_info->nr_nodes - 1); i++)
>> +        b_info->vnuma_mem[i] = vnodemem;
>> +    /* add the reminder to the last node. */
>> +    b_info->vnuma_mem[i] = vnodemem + n;
>> +    return 0;
>> +}
>> +
>> +static void vnuma_vnodemap_default(unsigned int *vnuma_vnodemap,
>> +                                   unsigned int nr_vnodes)
>> +{
>> +    unsigned int i;
>> +    for (i = 0; i < nr_vnodes; i++)
>> +        vnuma_vnodemap[i] = VNUMA_NO_NODE;
>> +}
>> +
>> +/*
>> + * init vNUMA to "zero config" with one node and all other
>> + * topology parameters set to default.
>> + */
>> +static int vnuma_zero_config(libxl_domain_build_info *b_info)
>> +{
>
> Haven't looked into details of this function, but I think this should be
> renamed to vnuma_default_config, from the reading of comment.
>
>> +    b_info->nr_nodes = 1;
>> +    /* all memory goes to this one vnode, as well as vcpus. */
>> +    if (!(b_info->vnuma_mem = (uint64_t *)calloc(b_info->nr_nodes,
>> +                                sizeof(*b_info->vnuma_mem))))
>> +        goto bad_vnumazerocfg;
>> +
>> +    if (!(b_info->vnuma_vcpumap = (unsigned int *)calloc(b_info->max_vcpus,
>> +                                sizeof(*b_info->vnuma_vcpumap))))
>> +        goto bad_vnumazerocfg;
>> +
>> +    if (!(b_info->vdistance = (unsigned int *)calloc(b_info->nr_nodes *
>> +                                b_info->nr_nodes, sizeof(*b_info->vdistance))))
>> +        goto bad_vnumazerocfg;
>> +
>> +    if (!(b_info->vnuma_vnodemap = (unsigned int *)calloc(b_info->nr_nodes,
>> +                                sizeof(*b_info->vnuma_vnodemap))))
>> +        goto bad_vnumazerocfg;
>> +
>> +    b_info->vnuma_mem[0] = b_info->max_memkb >> 10;
>> +
>> +    /* all vcpus assigned to this vnode. */
>> +    vcputovnode_default(b_info->vnuma_vcpumap, b_info->nr_nodes,
>> +                        b_info->max_vcpus);
>> +
>> +    /* default vdistance is 10. */
>> +    vdistance_set(b_info->vdistance, b_info->nr_nodes, 10, 10);
>> +
>> +    /* VNUMA_NO_NODE for vnode_to_pnode. */
>> +    vnuma_vnodemap_default(b_info->vnuma_vnodemap, b_info->nr_nodes);
>> +
>> +    /*
>> +     * will be placed to some physical nodes defined by automatic
>> +     * numa placement or VNUMA_NO_NODE will not request exact node.
>> +     */
>> +    libxl_defbool_set(&b_info->vnuma_autoplacement, true);
>> +    return 0;
>> +
>> + bad_vnumazerocfg:
>> +    return -1;
>> +}
>> +
>> +static void free_vnuma_info(libxl_domain_build_info *b_info)
>> +{
>> +    free(b_info->vnuma_mem);
>> +    free(b_info->vdistance);
>> +    free(b_info->vnuma_vcpumap);
>> +    free(b_info->vnuma_vnodemap);
>> +    b_info->nr_nodes = 0;
>> +}
>> +
>> +static int parse_vnuma_mem(XLU_Config *config,
>> +                            libxl_domain_build_info **b_info)
>> +{
>> +    libxl_domain_build_info *dst;
>> +    XLU_ConfigList *vnumamemcfg;
>> +    int nr_vnuma_regions, i;
>> +    unsigned long long vnuma_memparsed = 0;
>> +    unsigned long ul;
>> +    const char *buf;
>> +
>> +    dst = *b_info;
>> +    if (!xlu_cfg_get_list(config, "vnuma_mem",
>> +                          &vnumamemcfg, &nr_vnuma_regions, 0)) {
>> +
>> +        if (nr_vnuma_regions != dst->nr_nodes) {
>> +            fprintf(stderr, "Number of numa regions (vnumamem = %d) is \
>> +                    incorrect (should be %d).\n", nr_vnuma_regions,
>> +                    dst->nr_nodes);
>> +            goto bad_vnuma_mem;
>> +        }
>> +
>> +        dst->vnuma_mem = calloc(dst->nr_nodes,
>> +                                 sizeof(*dst->vnuma_mem));
>> +        if (dst->vnuma_mem == NULL) {
>> +            fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
>> +            goto bad_vnuma_mem;
>> +        }
>> +
>> +        char *ep;
>> +        /*
>> +         * Will parse only nr_vnodes times, even if we have more/less regions.
>> +         * Take care of it later if less or discard if too many regions.
>> +         */
>> +        for (i = 0; i < dst->nr_nodes; i++) {
>> +            buf = xlu_cfg_get_listitem(vnumamemcfg, i);
>> +            if (!buf) {
>> +                fprintf(stderr,
>> +                        "xl: Unable to get element %d in vnuma memory list.\n", i);
>> +                if (vnuma_zero_config(dst))
>> +                    goto bad_vnuma_mem;
>> +
>> +            }
>
> I think we should fail here instead of creating "default" config. See
> the reasoning I made on hypervisor side.

Yes, I see your point. I replied there as well that we just can have
both to drop to default/zero node config.

>
>> +            ul = strtoul(buf, &ep, 10);
>> +            if (ep == buf) {
>> +                fprintf(stderr, "xl: Invalid argument parsing vnumamem: %s.\n", buf);
>> +                if (vnuma_zero_config(dst))
>> +                    goto bad_vnuma_mem;
>> +            }
>> +
>
> Ditto.
>
>> +            /* 32Mb is a min size for a node, taken from Linux */
>> +            if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) {
>> +                fprintf(stderr, "xl: vnuma memory %lu is not within %u - %u range.\n",
>> +                        ul, MIN_VNODE_SIZE, UINT32_MAX);
>> +                if (vnuma_zero_config(dst))
>> +                    goto bad_vnuma_mem;
>> +            }
>> +
>
> Ditto.
>
> Wei.



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18 11:48     ` Wei Liu
@ 2014-07-20 14:57       ` Elena Ufimtseva
  2014-07-22 15:49         ` Dario Faggioli
  2014-07-22 14:03       ` Dario Faggioli
  1 sibling, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-20 14:57 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 5603 bytes --]

On Fri, Jul 18, 2014 at 7:48 AM, Wei Liu <wei.liu2@citrix.com> wrote:

> On Fri, Jul 18, 2014 at 12:13:36PM +0200, Dario Faggioli wrote:
> > On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:
> > > Hi! Another new series!
> > >
> > :-)
> >
> > > On Fri, Jul 18, 2014 at 01:49:59AM -0400, Elena Ufimtseva wrote:
> >
> > > > The workaround is to specify cpuid in config file and not use SMT.
> But soon I will come up
> > > > with some other acceptable solution.
> > > >
> > >
> > For Elena, workaround like what?
>

In the workaround I used I configured vcpus (as we have ht/smt turned on)
caches as this:



>  >
> > > I've also encountered this. I suspect that even if you disble SMT with
> > > cpuid in config file, the cpu topology in guest might still be wrong.
> > >
> > Can I ask why?
> >
>
> Because for a PV guest (currently) the guest kernel sees the real "ID"s
> for a cpu. See those "ID"s I change in my hacky patch.
>

Yep, thats what I see as well.

>
> > > What do hwloc-ls and lscpu show? Do you see any weird topology like one
> > > core belongs to one node while three belong to another?
> > >
> > Yep, that would be interesting to see.
> >
> > >  (I suspect not
> > > because your vcpus are already pinned to a specific node)
> > >
> > Sorry, I'm not sure I follow here... Are you saying that things probably
> > works ok, but that is (only) because of pinning?
>
> Yes, given that you derive numa memory allocation from cpu pinning or
> use combination of cpu pinning, vcpu to vnode map and vnode to pnode
> map, in those cases those IDs might reflect the right topology.
>
> >
> > I may be missing something here, but would it be possible to at least
> > try to make sure that the virtual topology and the topology related
> > content of CPUID actually agree? And I mean doing it automatically (if
>
> This is what I'm doing in my hack. :-)
>
> > only one of the two is specified) and to either error or warn if that is
> > not possible (if both are specified and they disagree)?
> >
> > I admit I'm not a CPUID expert, but I always thought this could be a
> > good solution...
> >
> > > What I did was to manipulate various "id"s in Linux kernel, so that I
> > > create a topology like 1 core : 1 cpu : 1 socket mapping.
> > >
> > And how this topology maps/interact with the virtual topology we want
> > the guest to have?
> >
>
> Say you have a two nodes guest, with 4 vcpus, you now have two sockets
> per node, each socket has one cpu, each cpu has one core.
>
> Node 0:
>   Socket 0:
>     CPU0:
>       Core 0
>   Socket 1:
>     CPU 1:
>       Core 1
> Node 1:
>   Socket 2:
>     CPU 2:
>       Core 2
>   Socket 3:
>     CPU 3:
>       Core 3
>
> > > In that case
> > > guest scheduler won't be able to make any assumption on individual CPU
> > > sharing caches with each other.
> > >
> > And, apart from SMT, what topology does the guest see then?
> >
>
> See above.
>
> > In any case, if this only alter SMT-ness (where "alter"="disable"), I
> > think that is fine too. What I'm failing at seeing is whether and why
> > this approach is more powerful than manipulating CPUID from config file.
> >
> > I'm insisting because, if they'd be equivalent, in terms of results, I
> > think it's easier, cleaner and more correct to deal with CPUID in xl and
> > libxl (automatically or semi-automatically).
> >
>
> SMT is just one aspect of the story that easily surfaces.
>
> In my opinion, if we don't manually create some kind of topology for the
> guest, the guest might end up with something weird. For example, if you
> have a 2 nodes, 4 sockets, 8 cpus, 8 cores system, you might have
>
> Node 0:
>   Socket 0
>     CPU0
>   Socket 1
>     CPU1
> Node 1:
>   Socket 2
>     CPU 3
>     CPU 4
>
> which all stems from guest having knowledge of real CPU "ID"s.
>
> And this topology is just wrong, it might just be the one during guest
> creation. Xen is free to schedule vcpus to different pcpus, so guest
> scheduler will make wrong decision based on errnous information.
>
> That's why I chose to have 1 core : 1 cpu : 1 socket mapping, so that
> guest makes no assumption on cache sharing etc. It's suboptimal but
> should provide predictable average performance. What do you think?
>

Running lstopo with vNUMA enabled in guest with 4 vnodes, 8 vcpus:
root@heatpipe:~# lstopo

Machine (7806MB) + L3 L#0 (7806MB 10MB) + L2 L#0 (7806MB 256KB) + L1d L#0
(7806MB 32KB) + L1i L#0 (7806MB 32KB)
  NUMANode L#0 (P#0 1933MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#4)
  NUMANode L#1 (P#1 1967MB) + Socket L#1
    Core L#2 + PU L#2 (P#1)
    Core L#3 + PU L#3 (P#5)
  NUMANode L#2 (P#2 1969MB) + Socket L#2
    Core L#4 + PU L#4 (P#2)
    Core L#5 + PU L#5 (P#6)
  NUMANode L#3 (P#3 1936MB) + Socket L#3
    Core L#6 + PU L#6 (P#3)
    Core L#7 + PU L#7 (P#7)

Basically, L2 and L1 are shared between nodes :)

I have manipulated cache sharing options before in cpuid but I agree with
Wei its just a part of the problem.
Along with number of logical processor numbers (if HT is enabled), I guess
we need to construct apic ids (if its not done yet, I could not find it) and
cache sharing cpuids maybe needed, taking into account pinning if set.

Like its described here:
https://software.intel.com/en-us/articles/methods-to-utilize-intels-hyper-threading-technology-with-linux

"The Initial APIC ID is composed of the physical processor's ID and the
logical processor's ID within the physical processor. The least significant
bits of the APIC ID are used to identify the logical processors within a
single physical processor."




>
> Wei.
>



-- 
Elena

[-- Attachment #1.2: Type: text/html, Size: 8169 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/10] libxl: build numa nodes memory blocks
  2014-07-20 12:58     ` Elena Ufimtseva
@ 2014-07-20 15:59       ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2014-07-20 15:59 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich, Wei Liu

On Sun, Jul 20, 2014 at 08:58:14AM -0400, Elena Ufimtseva wrote:
> On Fri, Jul 18, 2014 at 7:01 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> >
> > On Fri, Jul 18, 2014 at 01:50:07AM -0400, Elena Ufimtseva wrote:
> > [...]
> > >
> > > +bool libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info);
> > > +
> > > +int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> > > +                                  uint32_t *nr_entries,
> > > +                                  unsigned long map_limitkb,
> > > +                                  unsigned long balloon_kb);
> > > +
> >
> Hi Wei
> 
> Thanks for comments.
> 
> >
> > e820_sanitize should not take a ctx? It's internal function anyway.
> > But this is not your fault so don't worry about it.
> >
> > And this function seems to be arch-specific so I wonder if there's
> > better place for it.
> 
> 
> I do feel the same way. Would be libxl_arch.h is a better place?

I think this one is for Ian and Ian.

> >
> >
> > > +int libxl__vnuma_align_mem(libxl__gc *gc,
> > > +                                     uint32_t domid,
> > > +                                     struct libxl_domain_build_info *b_info,
> > > +                                     vmemrange_t *memblks);
> > > +
> >
> > Indentation.
> >
> > > +/*
> > > + * For each node, build memory block start and end addresses.
> > > + * Substract any memory hole from the range found in e820 map.
> > > + * vnode memory size are passed here in megabytes, the result is
> > > + * in memory block addresses.
> > > + * Linux kernel will adjust numa memory block sizes on its own.
> > > + * But we want to provide to the kernel numa block addresses that
> > > + * will be the same in kernel and hypervisor.
> > > + */
> > > +#define max(a,b) ((a > b) ? a : b)
> >
> > You won't need to redefine max, I think we have this somewhere already.
> 
> I think there is a patch in libxc (from Andrew Cooper) "tools/libxc:
> Shuffle definitions and uses of min()/max() macros"
> This one you are talking about?

I think so.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-20 13:16     ` Elena Ufimtseva
@ 2014-07-20 15:59       ` Wei Liu
  2014-07-22 15:18         ` Dario Faggioli
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-20 15:59 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich, Wei Liu

On Sun, Jul 20, 2014 at 09:16:11AM -0400, Elena Ufimtseva wrote:
[...]
> >> +
> >
> > I have question about this strategy. Is there any reason to choose to
> > fallback to this one node? In that case the toolstack will have
> > different view of the guest than the hypervisor. Toolstack still thinks
> > this guest has several nodes while this guest has only one. The can
> > cause problem when migrating a guest. Consider this, toolstack on the
> > remote end still builds two nodes given the fact that it's what it
> > knows, then the guest originally has one node notices the change in
> > underlying memory topology and crashes.
> >
> > IMHO we should just fail in this case. It's not that common to fail a
> > small array allocation anyway. This approach can also save you from
> > writing this function. :-)
> 
> I see and agree )
> 
> Do you mean fail as to not set any vnuma for domain? If yes, it sort of
> contradicts with statement 'every pv domain has at least one vnuma node'.
> Would be it reasonable on failed call to xc_domain_setvnuma from libxl
> to fallback to
> one node in tool stack as well?
> 

I mean:
1. we should setup one node if user doesn't specify one (the default
   case)
2. we should fail if user specifies vnuma config but we fail to allocate
   relevant structures, instead of falling back to one node setup,
   because this leads to discrepancy between hypervisor and toolstack

By "fail" I mean returning error code from hypervisor to toolstack
saying that domain creation fails.

#1 is consistent with what we already have. It's #2 here that I
proposed.

> >
> >> +/*
> >> + * construct vNUMA topology form u_vnuma struct and return
> >> + * it in dst.
> >> + */
> > [...]
> >> +
[...]
> >> +struct vnuma_topology_info {
> >> +    /* IN */
> >> +    domid_t domid;
> >> +    /* IN/OUT */
> >> +    unsigned int nr_vnodes;
> >> +    unsigned int nr_vcpus;
> >> +    /* OUT */
> >> +    union {
> >> +        XEN_GUEST_HANDLE(uint) h;
> >> +        uint64_t pad;
> >> +    } vdistance;
> >> +    union {
> >> +        XEN_GUEST_HANDLE(uint) h;
> >> +        uint64_t pad;
> >> +    } vcpu_to_vnode;
> >> +    union {
> >> +        XEN_GUEST_HANDLE(vmemrange_t) h;
> >> +        uint64_t pad;
> >> +    } vmemrange;
> >
> > Why do you need to use union? The other interface you introduce in this
> > patch doesn't use union.
> 
> This is one is for making sure on 32 and 64 bits the structures are of
> the same size.
> 

I can see other similiar occurences of XEN_GUEST_HANDLE don't need
padding. Did I miss something?

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
                   ` (11 preceding siblings ...)
  2014-07-18  9:53 ` Wei Liu
@ 2014-07-22 12:49 ` Dario Faggioli
  2014-07-23  5:59   ` Elena Ufimtseva
  12 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 12:49 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Wei Liu


[-- Attachment #1.1: Type: text/plain, Size: 4170 bytes --]

On ven, 2014-07-18 at 01:49 -0400, Elena Ufimtseva wrote:
> vNUMA introduction
>
Hey Elena!

Thanks for this series, and in particular for this clear and complete
cover letter.

> This series of patches introduces vNUMA topology awareness and
> provides interfaces and data structures to enable vNUMA for
> PV guests. There is a plan to extend this support for dom0 and
> HVM domains.
> 
> vNUMA topology support should be supported by PV guest kernel.
> Corresponding patches should be applied.
> 
> Introduction
> -------------
> 
> vNUMA topology is exposed to the PV guest to improve performance when running
> workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
> machines and thus having virtual NUMA topology visible to guests.
> XEN vNUMA implementation provides a way to run vNUMA-enabled guests on NUMA/UMA
> and flexibly map vNUMA topology to physical NUMA topology.
> 
> Mapping to physical NUMA topology may be done in manual and automatic way.
> By default, every PV domain has one vNUMA node. It is populated by default
> parameters and does not affect performance. To use automatic way of initializing
> vNUMA topology, configuration file need only to have number of vNUMA nodes
> defined. Not-defined vNUMA topology parameters will be initialized to default
> ones.
> 
> vNUMA topology is currently defined as a set of parameters such as:
>     number of vNUMA nodes;
>     distance table;
>     vnodes memory sizes;
>     vcpus to vnodes mapping;
>     vnode to pnode map (for NUMA machines).
> 
I'd include a brief explanation of what each parameter means and does.

>     XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
> vNUMA topology with user defined configuration or the parameters by default.
> vNUMA is defined for every PV domain and if no vNUMA configuration found,
> one vNUMA node is initialized and all cpus are assigned to it. All other
> parameters set to their default values.
> 
>     XENMEM_gevnumainfo is used by the PV domain to get the information
> from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
> for different vNUMA parameters and hypervisor fills it with topology.
> Future work to use this in HVM guests in the toolstack is required and
> in the hypervisor to allow HVM guests to use these hypercalls.
> 
> libxl
> 
> libxl allows us to define vNUMA topology in configuration file and verifies that
> configuration is correct. libxl also verifies mapping of vnodes to pnodes and
> uses it in case of NUMA-machine and if automatic placement was disabled. In case
> of incorrect/insufficient configuration, one vNUMA node will be initialized
> and populated with default values.
> 
Well, about automatic placement, I think we don't need to disable vNUMA
if it's enabled. In fact, automatic placement will try to place the
domain on one node only, and yes, if it manages to do so, no point
enabling vNUMA (unless the user asked for it, as you're saying). OTOH,
if automatic placement puts the domain on 2 or more nodes (e.g., because
the domain is 4G, and there is only 3G free on each node), then I think
vNUMA should chime in, and provide the guest with an appropriate,
internally built, NUMA topology.

> libxc
> 
> libxc builds the vnodes memory addresses for guest and makes necessary
> alignments to the addresses. It also takes into account guest e820 memory map
> configuration. The domain memory is allocated and vnode to pnode mapping
> is used to determine target node for particular vnode. If this mapping was not
> defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
> default not node-specific allocation will be used.
> 
Ditto. However, automatic placement does not do much at the libxc level
right now, and I think that should continue to be the case.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-18 11:48     ` Wei Liu
  2014-07-20 14:57       ` Elena Ufimtseva
@ 2014-07-22 14:03       ` Dario Faggioli
  2014-07-22 14:48         ` Wei Liu
  2014-07-22 19:43         ` Is: cpuid creation of PV guests is not correct. Was:Re: " Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 14:03 UTC (permalink / raw)
  To: Wei Liu
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 6828 bytes --]

On ven, 2014-07-18 at 12:48 +0100, Wei Liu wrote:
> On Fri, Jul 18, 2014 at 12:13:36PM +0200, Dario Faggioli wrote:
> > On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:

> > > I've also encountered this. I suspect that even if you disble SMT with
> > > cpuid in config file, the cpu topology in guest might still be wrong.
> > >
> > Can I ask why?
> > 
> 
> Because for a PV guest (currently) the guest kernel sees the real "ID"s
> for a cpu. See those "ID"s I change in my hacky patch.
> 
Right, now I see/remember it. Well, this is, I think, something we
should try to fix _independently_ from vNUMA, isn't it?

I mean, even right now, PV guests see completely random cache-sharing
topology, and that does (at least potentially) affect performance, as
the guest scheduler will make incorrect/inconsistent assumptions.

I'm not sure what the correct fix is. Probably something similar to what
you're doing in your hack... but, indeed, I think we should do something
about this!

> > > What do hwloc-ls and lscpu show? Do you see any weird topology like one
> > > core belongs to one node while three belong to another?
> > >
> > Yep, that would be interesting to see.
> > 
> > >  (I suspect not
> > > because your vcpus are already pinned to a specific node)
> > > 
> > Sorry, I'm not sure I follow here... Are you saying that things probably
> > works ok, but that is (only) because of pinning?
> 
> Yes, given that you derive numa memory allocation from cpu pinning or
> use combination of cpu pinning, vcpu to vnode map and vnode to pnode
> map, in those cases those IDs might reflect the right topology.
> 
Well, pinning does (should?) not always happen, as a consequence of a
virtual topology being used.

So, again, I don't think we should rely on pinning to have a sane and,
more important, consistent SMT and cache sharing topology.

Linux maintainers, any ideas?


BTW, I tried a few examples, on the following host:

root@benny:~# xl info -n
...
nr_cpus                : 8
max_cpu_id             : 15
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 3591
...
cpu_topology           :
cpu:    core    socket     node
  0:       0        0        0
  1:       0        0        0
  2:       1        0        0
  3:       1        0        0
  4:       2        0        0
  5:       2        0        0
  6:       3        0        0
  7:       3        0        0
numa_info              :
node:    memsize    memfree    distances
   0:     34062      31029      10

With the following guest configuration, in terms of vcpu pinning:

1) 2 vCPUs ==> same pCPUs
root@benny:~# xl vcpu-list 
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
debian.guest.osstest                 9     0    0   -b-       2.7  0
debian.guest.osstest                 9     1    0   -b-       5.2  0
debian.guest.osstest                 9     2    7   -b-       2.4  7
debian.guest.osstest                 9     3    7   -b-       4.4  7

2) no SMT
root@benny:~# xl vcpu-list 
Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
debian.guest.osstest                11     0    0   -b-       0.6  0
debian.guest.osstest                11     1    2   -b-       0.4  2
debian.guest.osstest                11     2    4   -b-       1.5  4
debian.guest.osstest                11     3    6   -b-       0.5  6

3) Random
root@benny:~# xl vcpu-list 
Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
debian.guest.osstest                12     0    3   -b-       1.6  all
debian.guest.osstest                12     1    1   -b-       1.4  all
debian.guest.osstest                12     2    5   -b-       2.4  all
debian.guest.osstest                12     3    7   -b-       1.5  all

4) yes SMT
root@benny:~# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU
Affinity
debian.guest.osstest                14     0    1   -b-       1.0  1
debian.guest.osstest                14     1    2   -b-       1.8  2
debian.guest.osstest                14     2    6   -b-       1.1  6
debian.guest.osstest                14     3    7   -b-       0.8  7

And, in *all* these 4 cases, here's what I see:

root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/core_siblings_list
0-3
0-3
0-3
0-3

root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
0-3
0-3
0-3
0-3

root@debian:~# lstopo
Machine (488MB) + Socket L#0 + L3 L#0 (8192KB) + L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)

root@debian:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    4
Core(s) per socket:    1
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Stepping:              3
CPU MHz:               3591.780
BogoMIPS:              7183.56
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
they were all SMT siblings, within the same core, sharing all cache
levels.

This is not the case for dom0 where (I booted with dom0_max_vcpus=4 on
the xen command line) I see this:

root@benny:~# lstopo
Machine (422MB)
  Socket L#0 + L3 L#0 (8192KB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)

root@benny:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Stepping:              3
CPU MHz:               3591.780
BogoMIPS:              7183.56
Hypervisor vendor:     Xen
Virtualization type:   none
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

What am I doing wrong, or what am I missing?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 14:03       ` Dario Faggioli
@ 2014-07-22 14:48         ` Wei Liu
  2014-07-22 15:06           ` Dario Faggioli
  2014-07-22 19:43         ` Is: cpuid creation of PV guests is not correct. Was:Re: " Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-07-22 14:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva,
	Wei Liu

On Tue, Jul 22, 2014 at 04:03:44PM +0200, Dario Faggioli wrote:
> On ven, 2014-07-18 at 12:48 +0100, Wei Liu wrote:
> > On Fri, Jul 18, 2014 at 12:13:36PM +0200, Dario Faggioli wrote:
> > > On ven, 2014-07-18 at 10:53 +0100, Wei Liu wrote:
> 
> > > > I've also encountered this. I suspect that even if you disble SMT with
> > > > cpuid in config file, the cpu topology in guest might still be wrong.
> > > >
> > > Can I ask why?
> > > 
> > 
> > Because for a PV guest (currently) the guest kernel sees the real "ID"s
> > for a cpu. See those "ID"s I change in my hacky patch.
> > 
> Right, now I see/remember it. Well, this is, I think, something we
> should try to fix _independently_ from vNUMA, isn't it?
> 
> I mean, even right now, PV guests see completely random cache-sharing
> topology, and that does (at least potentially) affect performance, as
> the guest scheduler will make incorrect/inconsistent assumptions.
> 

Correct. It's just that it might be more obvious to see the problem with
vNUMA.

> I'm not sure what the correct fix is. Probably something similar to what
> you're doing in your hack... but, indeed, I think we should do something
> about this!
> 
> > > > What do hwloc-ls and lscpu show? Do you see any weird topology like one
> > > > core belongs to one node while three belong to another?
> > > >
> > > Yep, that would be interesting to see.
> > > 
> > > >  (I suspect not
> > > > because your vcpus are already pinned to a specific node)
> > > > 
> > > Sorry, I'm not sure I follow here... Are you saying that things probably
> > > works ok, but that is (only) because of pinning?
> > 
> > Yes, given that you derive numa memory allocation from cpu pinning or
> > use combination of cpu pinning, vcpu to vnode map and vnode to pnode
> > map, in those cases those IDs might reflect the right topology.
> > 
> Well, pinning does (should?) not always happen, as a consequence of a
> virtual topology being used.
> 

That's true. I was just referring to the current status of the patch
series. AIUI that's how it is implemented now, not necessary the way it
has to be.

> So, again, I don't think we should rely on pinning to have a sane and,
> more important, consistent SMT and cache sharing topology.
> 
> Linux maintainers, any ideas?
> 
> 
> BTW, I tried a few examples, on the following host:
> 
> root@benny:~# xl info -n
> ...
> nr_cpus                : 8
> max_cpu_id             : 15
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 2
> cpu_mhz                : 3591
> ...
> cpu_topology           :
> cpu:    core    socket     node
>   0:       0        0        0
>   1:       0        0        0
>   2:       1        0        0
>   3:       1        0        0
>   4:       2        0        0
>   5:       2        0        0
>   6:       3        0        0
>   7:       3        0        0
> numa_info              :
> node:    memsize    memfree    distances
>    0:     34062      31029      10
> 
> With the following guest configuration, in terms of vcpu pinning:
> 
> 1) 2 vCPUs ==> same pCPUs

4 vcpus, I think.

> root@benny:~# xl vcpu-list 
> Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
> debian.guest.osstest                 9     0    0   -b-       2.7  0
> debian.guest.osstest                 9     1    0   -b-       5.2  0
> debian.guest.osstest                 9     2    7   -b-       2.4  7
> debian.guest.osstest                 9     3    7   -b-       4.4  7
> 
> 2) no SMT
> root@benny:~# xl vcpu-list 
> Name                                ID  VCPU   CPU State   Time(s) CPU
> Affinity
> debian.guest.osstest                11     0    0   -b-       0.6  0
> debian.guest.osstest                11     1    2   -b-       0.4  2
> debian.guest.osstest                11     2    4   -b-       1.5  4
> debian.guest.osstest                11     3    6   -b-       0.5  6
> 
> 3) Random
> root@benny:~# xl vcpu-list 
> Name                                ID  VCPU   CPU State   Time(s) CPU
> Affinity
> debian.guest.osstest                12     0    3   -b-       1.6  all
> debian.guest.osstest                12     1    1   -b-       1.4  all
> debian.guest.osstest                12     2    5   -b-       2.4  all
> debian.guest.osstest                12     3    7   -b-       1.5  all
> 
> 4) yes SMT
> root@benny:~# xl vcpu-list
> Name                                ID  VCPU   CPU State   Time(s) CPU
> Affinity
> debian.guest.osstest                14     0    1   -b-       1.0  1
> debian.guest.osstest                14     1    2   -b-       1.8  2
> debian.guest.osstest                14     2    6   -b-       1.1  6
> debian.guest.osstest                14     3    7   -b-       0.8  7
> 
> And, in *all* these 4 cases, here's what I see:
> 
> root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/core_siblings_list
> 0-3
> 0-3
> 0-3
> 0-3
> 
> root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
> 0-3
> 0-3
> 0-3
> 0-3
> 
> root@debian:~# lstopo
> Machine (488MB) + Socket L#0 + L3 L#0 (8192KB) + L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>   PU L#0 (P#0)
>   PU L#1 (P#1)
>   PU L#2 (P#2)
>   PU L#3 (P#3)
> 

I won't be surprised if guest builds up a wrong topology, as what real
"ID"s it sees depends very much on what pcpus you pick.

Have you tried pinning vcpus to pcpus [0, 1, 2, 3]? That way you should
be able to see the same topology as the one you saw in Dom0?

> root@debian:~# lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    4
> Core(s) per socket:    1
> Socket(s):             1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 60
> Stepping:              3
> CPU MHz:               3591.780
> BogoMIPS:              7183.56
> Hypervisor vendor:     Xen
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              8192K
> 
> I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
> they were all SMT siblings, within the same core, sharing all cache
> levels.
> 
> This is not the case for dom0 where (I booted with dom0_max_vcpus=4 on
> the xen command line) I see this:
> 

I guess this is because you're basically picking pcpu 0-3 for Dom0. It
doesn't matter if you pin them or not.

Wei.

> root@benny:~# lstopo
> Machine (422MB)
>   Socket L#0 + L3 L#0 (8192KB)
>     L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>       PU L#0 (P#0)
>       PU L#1 (P#1)
>     L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>       PU L#2 (P#2)
>       PU L#3 (P#3)
> 
> root@benny:~# lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 60
> Stepping:              3
> CPU MHz:               3591.780
> BogoMIPS:              7183.56
> Hypervisor vendor:     Xen
> Virtualization type:   none
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              8192K
> 
> What am I doing wrong, or what am I missing?
> 
> Thanks and Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 14:48         ` Wei Liu
@ 2014-07-22 15:06           ` Dario Faggioli
  2014-07-22 16:47             ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 15:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 6228 bytes --]

On mar, 2014-07-22 at 15:48 +0100, Wei Liu wrote:
> On Tue, Jul 22, 2014 at 04:03:44PM +0200, Dario Faggioli wrote:

> > I mean, even right now, PV guests see completely random cache-sharing
> > topology, and that does (at least potentially) affect performance, as
> > the guest scheduler will make incorrect/inconsistent assumptions.
> > 
> 
> Correct. It's just that it might be more obvious to see the problem with
> vNUMA.
> 
Yep.

> > > Yes, given that you derive numa memory allocation from cpu pinning or
> > > use combination of cpu pinning, vcpu to vnode map and vnode to pnode
> > > map, in those cases those IDs might reflect the right topology.
> > > 
> > Well, pinning does (should?) not always happen, as a consequence of a
> > virtual topology being used.
> > 
> 
> That's true. I was just referring to the current status of the patch
> series. AIUI that's how it is implemented now, not necessary the way it
> has to be.
> 
Ok.

> > With the following guest configuration, in terms of vcpu pinning:
> > 
> > 1) 2 vCPUs ==> same pCPUs
> 
> 4 vcpus, I think.
> 
> > root@benny:~# xl vcpu-list 
> > Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
> > debian.guest.osstest                 9     0    0   -b-       2.7  0
> > debian.guest.osstest                 9     1    0   -b-       5.2  0
> > debian.guest.osstest                 9     2    7   -b-       2.4  7
> > debian.guest.osstest                 9     3    7   -b-       4.4  7
> > 
What I meant with "2 vCPUs" was that I was putting 2 vCPUs of the guest
(0 and 1) on the same pCPU (0), and the other 2 (2 and 3) on another
(7).

That should have meant a topology that does not share at least the least
cache level in the guest, but it is not.

> > 2) no SMT
> > root@benny:~# xl vcpu-list 
> > Name                                ID  VCPU   CPU State   Time(s) CPU
> > Affinity
> > debian.guest.osstest                11     0    0   -b-       0.6  0
> > debian.guest.osstest                11     1    2   -b-       0.4  2
> > debian.guest.osstest                11     2    4   -b-       1.5  4
> > debian.guest.osstest                11     3    6   -b-       0.5  6
> > 
> > 3) Random
> > root@benny:~# xl vcpu-list 
> > Name                                ID  VCPU   CPU State   Time(s) CPU
> > Affinity
> > debian.guest.osstest                12     0    3   -b-       1.6  all
> > debian.guest.osstest                12     1    1   -b-       1.4  all
> > debian.guest.osstest                12     2    5   -b-       2.4  all
> > debian.guest.osstest                12     3    7   -b-       1.5  all
> > 
> > 4) yes SMT
> > root@benny:~# xl vcpu-list
> > Name                                ID  VCPU   CPU State   Time(s) CPU
> > Affinity
> > debian.guest.osstest                14     0    1   -b-       1.0  1
> > debian.guest.osstest                14     1    2   -b-       1.8  2
> > debian.guest.osstest                14     2    6   -b-       1.1  6
> > debian.guest.osstest                14     3    7   -b-       0.8  7
> > 
> > And, in *all* these 4 cases, here's what I see:
> > 
> > root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/core_siblings_list
> > 0-3
> > 0-3
> > 0-3
> > 0-3
> > 
> > root@debian:~# cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
> > 0-3
> > 0-3
> > 0-3
> > 0-3
> > 
> > root@debian:~# lstopo
> > Machine (488MB) + Socket L#0 + L3 L#0 (8192KB) + L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> >   PU L#0 (P#0)
> >   PU L#1 (P#1)
> >   PU L#2 (P#2)
> >   PU L#3 (P#3)
> > 
> 
> I won't be surprised if guest builds up a wrong topology, as what real
> "ID"s it sees depends very much on what pcpus you pick.
> 
Exactly, but if I pin all the guest vCPUs on specific host pCPUs from
the very beginning (pinning specified in the config file, which is what
I'm doing), I should be able to control that...

> Have you tried pinning vcpus to pcpus [0, 1, 2, 3]? That way you should
> be able to see the same topology as the one you saw in Dom0?
> 
Well, at least some of the examples above should have shown some
non-shared cache levels already. Anyway, here it comes:

root@benny:~# xl vcpu-list 
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
debian.guest.osstest                15     0    0   -b-       1.8  0
debian.guest.osstest                15     1    1   -b-       0.7  1
debian.guest.osstest                15     2    2   -b-       0.6  2
debian.guest.osstest                15     3    3   -b-       0.7  3

root@debian:~# hwloc-ls --of console
Machine (488MB) + Socket L#0 + L3 L#0 (8192KB) + L2 L#0 (256KB) + L1 L#0
(32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#1)
  PU L#2 (P#2)
  PU L#3 (P#3)

root@debian:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    4
Core(s) per socket:    1
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Stepping:              3
CPU MHz:               3591.780
BogoMIPS:              7183.56
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

So, no, that is not giving the same result as in Dom0. :-(

> > This is not the case for dom0 where (I booted with dom0_max_vcpus=4 on
> > the xen command line) I see this:
> > 
> 
> I guess this is because you're basically picking pcpu 0-3 for Dom0. It
> doesn't matter if you pin them or not.
> 
That makes total sense, and in fact, I was not surprised about Dom0
looking like this... I rather am about not being able to get a similar
topology for the guest, no matter how I pin it... :-/

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
  2014-07-18 10:30   ` Wei Liu
  2014-07-18 13:49   ` Konrad Rzeszutek Wilk
@ 2014-07-22 15:14   ` Dario Faggioli
  2014-07-23  5:22     ` Elena Ufimtseva
  2014-07-23 14:06   ` Jan Beulich
  3 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 15:14 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 5944 bytes --]

On ven, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:

> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index cd64aea..895584a 100644

> @@ -297,6 +297,144 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
>              guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
>  }
>  
> +/*
> + * Allocates memory for vNUMA, **vnuma should be NULL.
> + * Caller has to make sure that domain has max_pages
> + * and number of vcpus set for domain.
> + * Verifies that single allocation does not exceed
> + * PAGE_SIZE.
> + */
> +static int vnuma_alloc(struct vnuma_info **vnuma,
> +                       unsigned int nr_vnodes,
> +                       unsigned int nr_vcpus,
> +                       unsigned int dist_size)
> +{
> +    struct vnuma_info *v;
> +
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
>
Do you need this? What for?

> +    /*
> +     * check if any of xmallocs exeeds PAGE_SIZE.
> +     * If yes, consider it as an error for now.
>
Do you mind elaborating a bit more on the 'for now'? Why 'for now'?
What's the plan for the future, etc. ...

> +     */
> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
> +        return -EINVAL;
> +
> +    v = xzalloc(struct vnuma_info);
> +    if ( !v )
> +        return -ENOMEM;
> +
> +    v->vdistance = xmalloc_array(unsigned int, dist_size);
> +    v->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
> +    v->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
> +    v->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
> +
> +    if ( v->vdistance == NULL || v->vmemrange == NULL ||
> +        v->vcpu_to_vnode == NULL || v->vnode_to_pnode == NULL )
> +    {
> +        vnuma_destroy(v);
> +        return -ENOMEM;
> +    }
> +
> +    *vnuma = v;
> +
> +    return 0;
> +}
> +
> +/*
> + * Allocate memory and construct one vNUMA node,
> + * set default parameters, assign all memory and
> + * vcpus to this node, set distance to 10.
> + */
> +static long vnuma_fallback(const struct domain *d,
> +                          struct vnuma_info **vnuma)
> +{
> + 
I think I agree with Wei, about this fallback not being necessary.

> +/*
> + * construct vNUMA topology form u_vnuma struct and return
> + * it in dst.
> + */
> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
> +                const struct domain *d,
> +                struct vnuma_info **dst)
> +{
> +    unsigned int dist_size, nr_vnodes = 0;
> +    long ret;
> +    struct vnuma_info *v = NULL;
> +
> +    ret = -EINVAL;
> +
Why not initialize 'ret' while defining it?

> +    /* If vNUMA topology already set, just exit. */
> +    if ( !u_vnuma || *dst )
> +        return ret;
> +
> +    nr_vnodes = u_vnuma->nr_vnodes;
> +
> +    if ( nr_vnodes == 0 )
> +        return ret;
> +
> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
> +        return ret;
> +
Mmmm, do we perhaps want to #define a maximum number of supported vitual
node, put it somewhere in an header, and use it for the check? I mean
something like what we have for the host (in that case, it's called
MAX_NUMNODES).

I mean, if UINT_MAX is 2^64, would it make sense to allow a 2^32 nodes
guest? 

> +    dist_size = nr_vnodes * nr_vnodes;
> +
> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
> +    if ( ret )
> +        return ret;
> +
> +    /* On failure, set only one vNUMA node and its success. */
> +    ret = 0;
> +
> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
> +        d->max_vcpus) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
> +        nr_vnodes) )
> +        goto vnuma_onenode;
> +
> +    v->nr_vnodes = nr_vnodes;
> +    *dst = v;
> +
> +    return ret;
> +
> +vnuma_onenode:
> +    vnuma_destroy(v);
> +    return vnuma_fallback(d, dst);
>
As said, just report the error and bail in this case.

> +}
> +
>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>  {
>      long ret = 0;
> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        struct vnuma_info *v = NULL;
> +
> +        ret = -EFAULT;
> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
> +            return ret;
> +
> +        ret = -EINVAL;
> +
> +        ret = vnuma_init(&op->u.vnuma, d, &v);
>
Rather pointless 'ret=-EINVAL', I would say. :-)

> +        if ( ret < 0 || v == NULL )
> +            break;
> +
> +        /* overwrite vnuma for domain */
> +        if ( !d->vnuma )
> +            vnuma_destroy(d->vnuma);
> +
> +        domain_lock(d);
> +        d->vnuma = v;
> +        domain_unlock(d);
> +
> +        ret = 0;
> +    }
> +    break;
> +
>      default:
>          ret = arch_do_domctl(op, d, u_domctl);
>          break;

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-20 15:59       ` Wei Liu
@ 2014-07-22 15:18         ` Dario Faggioli
  2014-07-23  5:33           ` Elena Ufimtseva
  0 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 15:18 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 1624 bytes --]

On dom, 2014-07-20 at 16:59 +0100, Wei Liu wrote:
> On Sun, Jul 20, 2014 at 09:16:11AM -0400, Elena Ufimtseva wrote:

> > >> +struct vnuma_topology_info {
> > >> +    /* IN */
> > >> +    domid_t domid;
> > >> +    /* IN/OUT */
> > >> +    unsigned int nr_vnodes;
> > >> +    unsigned int nr_vcpus;
> > >> +    /* OUT */
> > >> +    union {
> > >> +        XEN_GUEST_HANDLE(uint) h;
> > >> +        uint64_t pad;
> > >> +    } vdistance;
> > >> +    union {
> > >> +        XEN_GUEST_HANDLE(uint) h;
> > >> +        uint64_t pad;
> > >> +    } vcpu_to_vnode;
> > >> +    union {
> > >> +        XEN_GUEST_HANDLE(vmemrange_t) h;
> > >> +        uint64_t pad;
> > >> +    } vmemrange;
> > >
> > > Why do you need to use union? The other interface you introduce in this
> > > patch doesn't use union.
> > 
> > This is one is for making sure on 32 and 64 bits the structures are of
> > the same size.
> > 
> 
> I can see other similiar occurences of XEN_GUEST_HANDLE don't need
> padding. Did I miss something?
> 
I remember this coming up during review of an earlier version of Elena's
series, and I think I also remember the union with padding solution
being suggested, but I don't remember which round it was, and who
suggested it... Elena, up for some digging in your inbox (or xen-devel
archives)? :-P

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-20 14:57       ` Elena Ufimtseva
@ 2014-07-22 15:49         ` Dario Faggioli
  0 siblings, 0 replies; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 15:49 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Wei Liu


[-- Attachment #1.1: Type: text/plain, Size: 1862 bytes --]

On dom, 2014-07-20 at 10:57 -0400, Elena Ufimtseva wrote:


> Running lstopo with vNUMA enabled in guest with 4 vnodes, 8 vcpus:
> root@heatpipe:~# lstopo
> 
> 
> Machine (7806MB) + L3 L#0 (7806MB 10MB) + L2 L#0 (7806MB 256KB) + L1d
> L#0 (7806MB 32KB) + L1i L#0 (7806MB 32KB)
>   NUMANode L#0 (P#0 1933MB) + Socket L#0
>     Core L#0 + PU L#0 (P#0)
>     Core L#1 + PU L#1 (P#4)
>   NUMANode L#1 (P#1 1967MB) + Socket L#1
>     Core L#2 + PU L#2 (P#1)
>     Core L#3 + PU L#3 (P#5)
>   NUMANode L#2 (P#2 1969MB) + Socket L#2
>     Core L#4 + PU L#4 (P#2)
>     Core L#5 + PU L#5 (P#6)
>   NUMANode L#3 (P#3 1936MB) + Socket L#3
>     Core L#6 + PU L#6 (P#3)
>     Core L#7 + PU L#7 (P#7)
> 
> 
> Basically, L2 and L1 are shared between nodes :)
> 
> 
> I have manipulated cache sharing options before in cpuid but I agree
> with Wei its just a part of the problem.
>
It is indeed.

> Along with number of logical processor numbers (if HT is enabled), I
> guess we need to construct apic ids (if its not done yet, I could not
> find it) and
> cache sharing cpuids maybe needed, taking into account pinning if set.
> 
Well, I'm not sure. Thing is, this is a general issue, and we need to
find a general way to solve it, where with "general" I mean not
necessarily vNUMa related.

Once we'll have that, we can see how to take care of vNUMA.

I'm not sure I want to rely on pinning that much, as pinning can change
and, if it does, we'd be back to square one playing tricks to the in
guest's scheduler.

Let's see what others think...

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 15:06           ` Dario Faggioli
@ 2014-07-22 16:47             ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2014-07-22 16:47 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Elena Ufimtseva,
	Wei Liu

On Tue, Jul 22, 2014 at 05:06:37PM +0200, Dario Faggioli wrote:
[...]
> 
> > > With the following guest configuration, in terms of vcpu pinning:
> > > 
> > > 1) 2 vCPUs ==> same pCPUs
> > 
> > 4 vcpus, I think.
> > 
> > > root@benny:~# xl vcpu-list 
> > > Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
> > > debian.guest.osstest                 9     0    0   -b-       2.7  0
> > > debian.guest.osstest                 9     1    0   -b-       5.2  0
> > > debian.guest.osstest                 9     2    7   -b-       2.4  7
> > > debian.guest.osstest                 9     3    7   -b-       4.4  7
> > > 
> What I meant with "2 vCPUs" was that I was putting 2 vCPUs of the guest
> (0 and 1) on the same pCPU (0), and the other 2 (2 and 3) on another
> (7).
> 
> That should have meant a topology that does not share at least the least
> cache level in the guest, but it is not.
> 

I see.

> > > 2) no SMT
> > > root@benny:~# xl vcpu-list 
> > > Name                                ID  VCPU   CPU State   Time(s) CPU
[...]
> So, no, that is not giving the same result as in Dom0. :-(
> 
> > > This is not the case for dom0 where (I booted with dom0_max_vcpus=4 on
> > > the xen command line) I see this:
> > > 
> > 
> > I guess this is because you're basically picking pcpu 0-3 for Dom0. It
> > doesn't matter if you pin them or not.
> > 
> That makes total sense, and in fact, I was not surprised about Dom0
> looking like this... I rather am about not being able to get a similar
> topology for the guest, no matter how I pin it... :-/
> 

I guess it might also be related to the different CPUID policy Dom0 and
DomU have. Though in theory Dom0 and DomU are both PV domains they see
different CPUIDs, I think.

I only studied code to the point that I was able to manipulate topology
as I saw fit, so I could be wrong...

Wei

> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 14:03       ` Dario Faggioli
  2014-07-22 14:48         ` Wei Liu
@ 2014-07-22 19:43         ` Konrad Rzeszutek Wilk
  2014-07-22 22:34           ` Is: cpuid creation of PV guests is not correct Andrew Cooper
  2014-07-22 22:53           ` Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction Dario Faggioli
  1 sibling, 2 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-22 19:43 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Wei Liu,
	Elena Ufimtseva

> I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
> they were all SMT siblings, within the same core, sharing all cache
> levels.

My recollection was that the setting of these CPUID values is
tied in how the toolstack sees it - and since the toolstack
runs in the initial domain - that is where it picks this data up.

This problem had been discussed by Andrew Cooper at some point
(Hackathon? Emails? IRC?) and moved under the 'fix cpuid creation/parsing'.

I think that this issue should not affect Elena's patchset - 
as the vNUMA is an innocent bystander that gets affected by this.

As such changing the title.
> 
> This is not the case for dom0 where (I booted with dom0_max_vcpus=4 on
> the xen command line) I see this:
> 
> root@benny:~# lstopo
> Machine (422MB)
>   Socket L#0 + L3 L#0 (8192KB)
>     L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>       PU L#0 (P#0)
>       PU L#1 (P#1)
>     L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>       PU L#2 (P#2)
>       PU L#3 (P#3)
> 
> root@benny:~# lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 60
> Stepping:              3
> CPU MHz:               3591.780
> BogoMIPS:              7183.56
> Hypervisor vendor:     Xen
> Virtualization type:   none
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              8192K
> 
> What am I doing wrong, or what am I missing?
> 
> Thanks and Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 



> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Is: cpuid creation of PV guests is not correct.
  2014-07-22 19:43         ` Is: cpuid creation of PV guests is not correct. Was:Re: " Konrad Rzeszutek Wilk
@ 2014-07-22 22:34           ` Andrew Cooper
  2014-07-22 22:53           ` Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction Dario Faggioli
  1 sibling, 0 replies; 63+ messages in thread
From: Andrew Cooper @ 2014-07-22 22:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Dario Faggioli
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	stefano.stabellini, ian.jackson, xen-devel, JBeulich, Wei Liu,
	Elena Ufimtseva

On 22/07/2014 20:43, Konrad Rzeszutek Wilk wrote:
>> I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
>> they were all SMT siblings, within the same core, sharing all cache
>> levels.
> My recollection was that the setting of these CPUID values is
> tied in how the toolstack sees it - and since the toolstack
> runs in the initial domain - that is where it picks this data up.
>
> This problem had been discussed by Andrew Cooper at some point
> (Hackathon? Emails? IRC?) and moved under the 'fix cpuid creation/parsing'.
>
> I think that this issue should not affect Elena's patchset - 
> as the vNUMA is an innocent bystander that gets affected by this.
>
> As such changing the title.

There are a whole set of related issues with regard to cpuid under Xen
currently.  I investigated the problems from the point of view of
heterogeneous host feature levelling.  I do plan to work on these issues
(as feature levelling is an important usecase for XenServer) and will do
so when the migration v2 work is complete.

However, to summarise the issues:

Xen's notion of a domains cpuid policy was adequate for single-vcpu VMs,
but was never updated when multi-vcpu VMs were introduced.  There is no
concept of per-vcpu information in the policy, which is why all the
cache IDs you read are identical.

The policy is theoretically controlled exclusively from the toolstack. 
The toolstack has the responsibility of setting the contents of any
leaves it believes the guest might be interested in, and Xen stores
these values wholesale.  If a cpuid query is requested of a domain which
lacks an entry for that specific leaf, the information is retrieved by
running a cpuid instruction, which is not necessarily deterministic.

The toolstack, under the cpuid policy of the domain it is running in,
attempts to guess the featureset to be offered to a domain, with
possible influence from user-specified domain configuration.  Xen
doesn't validate the featureset when the policy is set.  Instead, there
is veto/sanity code used on all accesses to the policy.  As a result,
the cpuid values as seen by the guest are not necessarily the same as
the values successfully set by the toolstack.

The various IDs which are obtained from cpuid inside a domain will
happen to the the IDs available to libxc when it was building the policy
for the domain.  For a regular PV dom0, it will be the IDs available on
the pcpu (or several, given rescheduling) on which libxc was running.

Xen can completely control the values returned by the cpuid instruction
from HVM/PVM domains.  On the other hand, results for PV guests are
strictly opt-in via use of the Xen forced-emulation prefix.  As a
result, well behaved PV kernels will see the policy, but regular
userspace applications in PV guests will see the native cpuid results.

There are two caveats.  Intel Ivy-bridge (and later) hardware have
support for cpuid faulting which allows Xen to regain exactly the same
level of control over PV guests as it has for HVM guests.  There are
also cpuid masking (Intel)/override (AMD) MSRs (which vary in
availability between processor generations) which allow the visible
featureset of any cpuid instruction to be altered.


I have some vague plans for how to fix these issues, which I will need
to see about designing sensibly in due course.  However, a brief
overview is something like this:

* Ownership of the entire domain policy resides with Xen rather than the
toolstack, and when a domain is created, it shall inherit from the host
setup, given appropriate per-domain type restrictions.
* The toolstack may query and modify a domains policy, with verification
of the modifications before before they are accepted.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 19:43         ` Is: cpuid creation of PV guests is not correct. Was:Re: " Konrad Rzeszutek Wilk
  2014-07-22 22:34           ` Is: cpuid creation of PV guests is not correct Andrew Cooper
@ 2014-07-22 22:53           ` Dario Faggioli
  2014-07-23  6:00             ` Elena Ufimtseva
  1 sibling, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-07-22 22:53 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	lccycc123, ian.jackson, xen-devel, JBeulich, Wei Liu,
	Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 1139 bytes --]

On Tue, 2014-07-22 at 15:43 -0400, Konrad Rzeszutek Wilk wrote:
> > I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
> > they were all SMT siblings, within the same core, sharing all cache
> > levels.
> 
> My recollection was that the setting of these CPUID values is
> tied in how the toolstack sees it - and since the toolstack
> runs in the initial domain - that is where it picks this data up.
> 
> This problem had been discussed by Andrew Cooper at some point
> (Hackathon? Emails? IRC?) and moved under the 'fix cpuid creation/parsing'.
> 
> I think that this issue should not affect Elena's patchset - 
> as the vNUMA is an innocent bystander that gets affected by this.
> 
Agreed. Let's live with the WARN for now or, if that is annoying (e.g.
for testing/benchmarking) with whatever (local!) hack one needs.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-22 15:14   ` Dario Faggioli
@ 2014-07-23  5:22     ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-23  5:22 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich

On Tue, Jul 22, 2014 at 11:14 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On ven, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
>
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index cd64aea..895584a 100644
>
>> @@ -297,6 +297,144 @@ int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff)
>>              guest_handle_is_null(vcpuaff->cpumap_soft.bitmap));
>>  }
>>
>> +/*
>> + * Allocates memory for vNUMA, **vnuma should be NULL.
>> + * Caller has to make sure that domain has max_pages
>> + * and number of vcpus set for domain.
>> + * Verifies that single allocation does not exceed
>> + * PAGE_SIZE.
>> + */
>> +static int vnuma_alloc(struct vnuma_info **vnuma,
>> +                       unsigned int nr_vnodes,
>> +                       unsigned int nr_vcpus,
>> +                       unsigned int dist_size)
>> +{
>> +    struct vnuma_info *v;
>> +
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>>
> Do you need this? What for?
>
>> +    /*
>> +     * check if any of xmallocs exeeds PAGE_SIZE.
>> +     * If yes, consider it as an error for now.
>>
> Do you mind elaborating a bit more on the 'for now'? Why 'for now'?
> What's the plan for the future, etc. ...
>
>> +     */
>> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
>> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
>> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
>> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
>> +        return -EINVAL;
>> +
>> +    v = xzalloc(struct vnuma_info);
>> +    if ( !v )
>> +        return -ENOMEM;
>> +
>> +    v->vdistance = xmalloc_array(unsigned int, dist_size);
>> +    v->vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
>> +    v->vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus);
>> +    v->vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
>> +
>> +    if ( v->vdistance == NULL || v->vmemrange == NULL ||
>> +        v->vcpu_to_vnode == NULL || v->vnode_to_pnode == NULL )
>> +    {
>> +        vnuma_destroy(v);
>> +        return -ENOMEM;
>> +    }
>> +
>> +    *vnuma = v;
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Allocate memory and construct one vNUMA node,
>> + * set default parameters, assign all memory and
>> + * vcpus to this node, set distance to 10.
>> + */
>> +static long vnuma_fallback(const struct domain *d,
>> +                          struct vnuma_info **vnuma)
>> +{
>> +
> I think I agree with Wei, about this fallback not being necessary.
>
>> +/*
>> + * construct vNUMA topology form u_vnuma struct and return
>> + * it in dst.
>> + */
>> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
>> +                const struct domain *d,
>> +                struct vnuma_info **dst)
>> +{
>> +    unsigned int dist_size, nr_vnodes = 0;
>> +    long ret;
>> +    struct vnuma_info *v = NULL;
>> +
>> +    ret = -EINVAL;
>> +
> Why not initialize 'ret' while defining it?
>
>> +    /* If vNUMA topology already set, just exit. */
>> +    if ( !u_vnuma || *dst )
>> +        return ret;
>> +
>> +    nr_vnodes = u_vnuma->nr_vnodes;
>> +
>> +    if ( nr_vnodes == 0 )
>> +        return ret;
>> +
>> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
>> +        return ret;
>> +
> Mmmm, do we perhaps want to #define a maximum number of supported vitual
> node, put it somewhere in an header, and use it for the check? I mean
> something like what we have for the host (in that case, it's called
> MAX_NUMNODES).
>
> I mean, if UINT_MAX is 2^64, would it make sense to allow a 2^32 nodes
> guest?

True to that, no one needs that many nodes :) Will define a const in
v7. Probably, it will make sense
to set to the same as vcpu number?

>
>> +    dist_size = nr_vnodes * nr_vnodes;
>> +
>> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    /* On failure, set only one vNUMA node and its success. */
>> +    ret = 0;
>> +
>> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
>> +        d->max_vcpus) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
>> +        nr_vnodes) )
>> +        goto vnuma_onenode;
>> +
>> +    v->nr_vnodes = nr_vnodes;
>> +    *dst = v;
>> +
>> +    return ret;
>> +
>> +vnuma_onenode:
>> +    vnuma_destroy(v);
>> +    return vnuma_fallback(d, dst);
>>
> As said, just report the error and bail in this case.

Yes, agree on that.

>
>> +}
>> +
>>  long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>  {
>>      long ret = 0;
>> @@ -967,6 +1105,35 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>>      }
>>      break;
>>
>> +    case XEN_DOMCTL_setvnumainfo:
>> +    {
>> +        struct vnuma_info *v = NULL;
>> +
>> +        ret = -EFAULT;
>> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
>> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
>> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
>> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
>> +            return ret;
>> +
>> +        ret = -EINVAL;
>> +
>> +        ret = vnuma_init(&op->u.vnuma, d, &v);
>>
> Rather pointless 'ret=-EINVAL', I would say. :-)

>
>> +        if ( ret < 0 || v == NULL )
>> +            break;
>> +
>> +        /* overwrite vnuma for domain */
>> +        if ( !d->vnuma )
>> +            vnuma_destroy(d->vnuma);
>> +
>> +        domain_lock(d);
>> +        d->vnuma = v;
>> +        domain_unlock(d);
>> +
>> +        ret = 0;
>> +    }
>> +    break;
>> +
>>      default:
>>          ret = arch_do_domctl(op, d, u_domctl);
>>          break;
>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-22 15:18         ` Dario Faggioli
@ 2014-07-23  5:33           ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-23  5:33 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Wei Liu

On Tue, Jul 22, 2014 at 11:18 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On dom, 2014-07-20 at 16:59 +0100, Wei Liu wrote:
>> On Sun, Jul 20, 2014 at 09:16:11AM -0400, Elena Ufimtseva wrote:
>
>> > >> +struct vnuma_topology_info {
>> > >> +    /* IN */
>> > >> +    domid_t domid;
>> > >> +    /* IN/OUT */
>> > >> +    unsigned int nr_vnodes;
>> > >> +    unsigned int nr_vcpus;
>> > >> +    /* OUT */
>> > >> +    union {
>> > >> +        XEN_GUEST_HANDLE(uint) h;
>> > >> +        uint64_t pad;
>> > >> +    } vdistance;
>> > >> +    union {
>> > >> +        XEN_GUEST_HANDLE(uint) h;
>> > >> +        uint64_t pad;
>> > >> +    } vcpu_to_vnode;
>> > >> +    union {
>> > >> +        XEN_GUEST_HANDLE(vmemrange_t) h;
>> > >> +        uint64_t pad;
>> > >> +    } vmemrange;
>> > >
>> > > Why do you need to use union? The other interface you introduce in this
>> > > patch doesn't use union.
>> >
>> > This is one is for making sure on 32 and 64 bits the structures are of
>> > the same size.
>> >
>>
>> I can see other similiar occurences of XEN_GUEST_HANDLE don't need
>> padding. Did I miss something?
>>
> I remember this coming up during review of an earlier version of Elena's
> series, and I think I also remember the union with padding solution
> being suggested, but I don't remember which round it was, and who
> suggested it... Elena, up for some digging in your inbox (or xen-devel
> archives)? :-P
>

Yes, sure, I do remember that!
David offered to construct this that way, as digging reveals:
http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg02065.html


> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 12:49 ` Dario Faggioli
@ 2014-07-23  5:59   ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-23  5:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Wei Liu

On Tue, Jul 22, 2014 at 8:49 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On ven, 2014-07-18 at 01:49 -0400, Elena Ufimtseva wrote:
>> vNUMA introduction
>>
> Hey Elena!
>
> Thanks for this series, and in particular for this clear and complete
> cover letter.

Happy to hear this )

>
>> This series of patches introduces vNUMA topology awareness and
>> provides interfaces and data structures to enable vNUMA for
>> PV guests. There is a plan to extend this support for dom0 and
>> HVM domains.
>>
>> vNUMA topology support should be supported by PV guest kernel.
>> Corresponding patches should be applied.
>>
>> Introduction
>> -------------
>>
>> vNUMA topology is exposed to the PV guest to improve performance when running
>> workloads on NUMA machines. vNUMA enabled guests may be running on non-NUMA
>> machines and thus having virtual NUMA topology visible to guests.
>> XEN vNUMA implementation provides a way to run vNUMA-enabled guests on NUMA/UMA
>> and flexibly map vNUMA topology to physical NUMA topology.
>>
>> Mapping to physical NUMA topology may be done in manual and automatic way.
>> By default, every PV domain has one vNUMA node. It is populated by default
>> parameters and does not affect performance. To use automatic way of initializing
>> vNUMA topology, configuration file need only to have number of vNUMA nodes
>> defined. Not-defined vNUMA topology parameters will be initialized to default
>> ones.
>>
>> vNUMA topology is currently defined as a set of parameters such as:
>>     number of vNUMA nodes;
>>     distance table;
>>     vnodes memory sizes;
>>     vcpus to vnodes mapping;
>>     vnode to pnode map (for NUMA machines).
>>
> I'd include a brief explanation of what each parameter means and does.

Yep, will add this.
>
>>     XEN_DOMCTL_setvnumainfo is used by toolstack to populate domain
>> vNUMA topology with user defined configuration or the parameters by default.
>> vNUMA is defined for every PV domain and if no vNUMA configuration found,
>> one vNUMA node is initialized and all cpus are assigned to it. All other
>> parameters set to their default values.
>>
>>     XENMEM_gevnumainfo is used by the PV domain to get the information
>> from hypervisor about vNUMA topology. Guest sends its memory sizes allocated
>> for different vNUMA parameters and hypervisor fills it with topology.
>> Future work to use this in HVM guests in the toolstack is required and
>> in the hypervisor to allow HVM guests to use these hypercalls.
>>
>> libxl
>>
>> libxl allows us to define vNUMA topology in configuration file and verifies that
>> configuration is correct. libxl also verifies mapping of vnodes to pnodes and
>> uses it in case of NUMA-machine and if automatic placement was disabled. In case
>> of incorrect/insufficient configuration, one vNUMA node will be initialized
>> and populated with default values.
>>
> Well, about automatic placement, I think we don't need to disable vNUMA
> if it's enabled. In fact, automatic placement will try to place the
> domain on one node only, and yes, if it manages to do so, no point
> enabling vNUMA (unless the user asked for it, as you're saying). OTOH,
> if automatic placement puts the domain on 2 or more nodes (e.g., because
> the domain is 4G, and there is only 3G free on each node), then I think
> vNUMA should chime in, and provide the guest with an appropriate,
> internally built, NUMA topology.


Should we call it automatic vNUMA 'creation' then?  thought of it before
only in terms of placing vnodes to some sensible set of pnodes. But looks like
what you have just described is more about creating vNUMA.
Maybe we discussed it already :)

>
>> libxc
>>
>> libxc builds the vnodes memory addresses for guest and makes necessary
>> alignments to the addresses. It also takes into account guest e820 memory map
>> configuration. The domain memory is allocated and vnode to pnode mapping
>> is used to determine target node for particular vnode. If this mapping was not
>> defined, it is not a NUMA machine or automatic NUMA placement is enabled, the
>> default not node-specific allocation will be used.
>>
> Ditto. However, automatic placement does not do much at the libxc level
> right now, and I think that should continue to be the case.

Yes, I should have said that it only verifies/builds the mapping and
does memory allocation
for domain.

>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction
  2014-07-22 22:53           ` Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction Dario Faggioli
@ 2014-07-23  6:00             ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-23  6:00 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Li Yechen, Keir Fraser, Ian Campbell, Stefano Stabellini,
	George Dunlap, Matt Wilson, Ian Jackson, xen-devel, Jan Beulich,
	Wei Liu

On Tue, Jul 22, 2014 at 6:53 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Tue, 2014-07-22 at 15:43 -0400, Konrad Rzeszutek Wilk wrote:
>> > I.e., no matter how I pin the vcpus, the guest sees the 4 vcpus as if
>> > they were all SMT siblings, within the same core, sharing all cache
>> > levels.
>>
>> My recollection was that the setting of these CPUID values is
>> tied in how the toolstack sees it - and since the toolstack
>> runs in the initial domain - that is where it picks this data up.
>>
>> This problem had been discussed by Andrew Cooper at some point
>> (Hackathon? Emails? IRC?) and moved under the 'fix cpuid creation/parsing'.
>>
>> I think that this issue should not affect Elena's patchset -
>> as the vNUMA is an innocent bystander that gets affected by this.
>>
> Agreed. Let's live with the WARN for now or, if that is annoying (e.g.
> for testing/benchmarking) with whatever (local!) hack one needs.
>
> Regards,
> Dario

Thank you guys, works for me for now.
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
                     ` (2 preceding siblings ...)
  2014-07-22 15:14   ` Dario Faggioli
@ 2014-07-23 14:06   ` Jan Beulich
  2014-07-25  4:52     ` Elena Ufimtseva
  3 siblings, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2014-07-23 14:06 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel

>>> On 18.07.14 at 07:50, <ufimtseva@gmail.com> wrote:
> +static int vnuma_alloc(struct vnuma_info **vnuma,
> +                       unsigned int nr_vnodes,
> +                       unsigned int nr_vcpus,
> +                       unsigned int dist_size)
> +{
> +    struct vnuma_info *v;
> +
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
> +    /*
> +     * check if any of xmallocs exeeds PAGE_SIZE.
> +     * If yes, consider it as an error for now.
> +     */
> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
> +        dist_size > PAGE_SIZE / sizeof(dist_size) )

Three of the four checks are rather bogus - the types of the
variables just happen to match the types of the respective
array elements. Best to switch all of them to sizeof(*v->...).
Plus I'm not sure about the dist_size check - in its current shape
it's redundant with the nr_vnodes one (and really the function
parameter seems pointless, i.e. could be calculated here), and
it's questionable whether limiting that table against PAGE_SIZE
isn't too restrictive. Also indentation seems broken here.

> +static long vnuma_fallback(const struct domain *d,
> +                          struct vnuma_info **vnuma)
> +{
> +    struct vnuma_info *v;
> +    long ret;
> +
> +
> +    /* Will not destroy vNUMA here, destroy before calling this. */
> +    if ( vnuma && *vnuma )
> +        return -EINVAL;
> +
> +    v = *vnuma;
> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
> +    if ( ret )
> +        return ret;
> +
> +    v->vmemrange[0].start = 0;
> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;

Didn't we settle on using inclusive ranges to avoid problems at the
address space end? Or was that in the context of some other series
(likely by someone else)?

> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
> +                const struct domain *d,
> +                struct vnuma_info **dst)
> +{
> +    unsigned int dist_size, nr_vnodes = 0;
> +    long ret;
> +    struct vnuma_info *v = NULL;
> +
> +    ret = -EINVAL;
> +
> +    /* If vNUMA topology already set, just exit. */
> +    if ( !u_vnuma || *dst )
> +        return ret;
> +
> +    nr_vnodes = u_vnuma->nr_vnodes;
> +
> +    if ( nr_vnodes == 0 )
> +        return ret;
> +
> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
> +        return ret;
> +
> +    dist_size = nr_vnodes * nr_vnodes;
> +
> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
> +    if ( ret )
> +        return ret;
> +
> +    /* On failure, set only one vNUMA node and its success. */
> +    ret = 0;

Pointless - just use "return 0" below.

> +
> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
> +        d->max_vcpus) )

Indentation.

> +        goto vnuma_onenode;
> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
> +        nr_vnodes) )

Again.

> +        goto vnuma_onenode;
> +
> +    v->nr_vnodes = nr_vnodes;
> +    *dst = v;
> +
> +    return ret;
> +
> +vnuma_onenode:

Labels are to be indented by at least one space.

> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        struct vnuma_info *v = NULL;
> +
> +        ret = -EFAULT;
> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
> +            return ret;
> +
> +        ret = -EINVAL;
> +
> +        ret = vnuma_init(&op->u.vnuma, d, &v);
> +        if ( ret < 0 || v == NULL )

So when v == NULL you return success (ret being 0)? That second
check is either pointless (could become an ASSERT()) or needs proper
handling.

> +            break;
> +
> +        /* overwrite vnuma for domain */
> +        if ( !d->vnuma )
> +            vnuma_destroy(d->vnuma);
> +
> +        domain_lock(d);
> +        d->vnuma = v;
> +        domain_unlock(d);

domain_lock()? What does this guard against? (We generally aim at
removing uses of domain_lock() rather than adding new ones.)

> +    case XENMEM_get_vnumainfo:
> +    {
> +        struct vnuma_topology_info topology;
> +        struct domain *d;
> +        unsigned int dom_vnodes = 0;

Pointless initializer.

> +
> +        /*
> +         * guest passes nr_vnodes and nr_vcpus thus
> +         * we know how much memory guest has allocated.
> +         */
> +        if ( copy_from_guest(&topology, arg, 1) ||
> +            guest_handle_is_null(topology.vmemrange.h) ||
> +            guest_handle_is_null(topology.vdistance.h) ||
> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )
> +            return -EFAULT;
> +
> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
> +            return -ESRCH;
> +
> +        rc = -EOPNOTSUPP;
> +        if ( d->vnuma == NULL )
> +            goto vnumainfo_out;
> +
> +        if ( d->vnuma->nr_vnodes == 0 )
> +            goto vnumainfo_out;

Can this second condition validly (other than due to a race) be true if
the first one wasn't? (And of course there's synchronization missing
here, to avoid the race.)

> +
> +        dom_vnodes = d->vnuma->nr_vnodes;
> +
> +        /*
> +         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
> +         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
> +         */
> +        rc = -ENOBUFS;
> +        if ( topology.nr_vnodes < dom_vnodes ||
> +            topology.nr_vcpus < d->max_vcpus )
> +            goto vnumainfo_out;

You ought to be passing back the needed values in the structure fields.

> +
> +        rc = -EFAULT;
> +
> +        if ( copy_to_guest(topology.vmemrange.h, d->vnuma->vmemrange,
> +                           dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vdistance.h, d->vnuma->vdistance,
> +                           dom_vnodes * dom_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( copy_to_guest(topology.vcpu_to_vnode.h, d->vnuma->vcpu_to_vnode,
> +                           d->max_vcpus) != 0 )
> +            goto vnumainfo_out;
> +
> +        topology.nr_vnodes = dom_vnodes;

And why not topology.nr_vcpus?

> +
> +        if ( copy_to_guest(arg, &topology, 1) != 0 )

__copy_to_guest() please.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/10] vnuma hook to debug-keys u
  2014-07-18  5:50 ` [PATCH v6 03/10] vnuma hook to debug-keys u Elena Ufimtseva
@ 2014-07-23 14:10   ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2014-07-23 14:10 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel

>>> On 18.07.14 at 07:50, <ufimtseva@gmail.com> wrote:
> @@ -389,6 +389,33 @@ static void dump_numa(unsigned char key)
>  
>  		for_each_online_node(i)
>  			printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +		if ( d->vnuma ) {
> +			printk("    Domain has %u vnodes, %u vcpus\n", d->vnuma->nr_vnodes, d->max_vcpus);
> +			for ( i = 0; i < d->vnuma->nr_vnodes; i++ ) {
> +				err = snprintf(keyhandler_scratch, 12, "%u", d->vnuma->vnode_to_pnode[i]);
> +				if ( err < 0 )
> +					printk("        vnode %u - pnode %s,", i, "any");

"any"? This is more like "unknown" or "???".

> +				else
> +					printk("        vnode %u - pnode %s,", i,
> +				d->vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ? "any" : keyhandler_scratch);
> +				printk(" %"PRIu64" MB, ",
> +					(d->vnuma->vmemrange[i].end - d->vnuma->vmemrange[i].start) >> 20);
> +				printk("vcpu nums: ");

By strcpy()ing into keyhandler_scratch in the special case paths above
you could collapse all three printk()s into one.

> +				for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
> +					if ( d->vnuma->vcpu_to_vnode[j] == i ) {
> +						if ( ((n + 1) % 8) == 0 )
> +							printk("%d\n", j);
> +						else if ( !(n % 8) && n != 0 )
> +							printk("%s%d ", "             ", j);
> +						else
> +							printk("%d ", j);
> +					n++;

Indentation.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-23 14:06   ` Jan Beulich
@ 2014-07-25  4:52     ` Elena Ufimtseva
  2014-07-25  7:33       ` Jan Beulich
  0 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-07-25  4:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Li Yechen, George Dunlap, Matt Wilson,
	Dario Faggioli, Stefano Stabellini, Ian Jackson, xen-devel

On Wed, Jul 23, 2014 at 10:06 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 18.07.14 at 07:50, <ufimtseva@gmail.com> wrote:
>> +static int vnuma_alloc(struct vnuma_info **vnuma,
>> +                       unsigned int nr_vnodes,
>> +                       unsigned int nr_vcpus,
>> +                       unsigned int dist_size)
>> +{
>> +    struct vnuma_info *v;
>> +
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>> +    /*
>> +     * check if any of xmallocs exeeds PAGE_SIZE.
>> +     * If yes, consider it as an error for now.
>> +     */
>> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
>> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
>> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
>> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
>
Hi Jan,
Thank you for your review.

> Three of the four checks are rather bogus - the types of the
> variables just happen to match the types of the respective
> array elements. Best to switch all of them to sizeof(*v->...).
> Plus I'm not sure about the dist_size check - in its current shape
> it's redundant with the nr_vnodes one (and really the function
> parameter seems pointless, i.e. could be calculated here), and
> it's questionable whether limiting that table against PAGE_SIZE
> isn't too restrictive. Also indentation seems broken here.


I agree on distance table memory allocation limit.
The max vdistance (in current interface) table dimension will be
effectively 256 after nr_vnodes size check,
so the max vNUMA nodes. Thus vdistance table will need to allocate 2
pages of 4K size.
Will be that viewed as a potential candidate to a list of affected
hypercalls in XSA-77?

>
>> +static long vnuma_fallback(const struct domain *d,
>> +                          struct vnuma_info **vnuma)
>> +{
>> +    struct vnuma_info *v;
>> +    long ret;
>> +
>> +
>> +    /* Will not destroy vNUMA here, destroy before calling this. */
>> +    if ( vnuma && *vnuma )
>> +        return -EINVAL;
>> +
>> +    v = *vnuma;
>> +    ret = vnuma_alloc(&v, 1, d->max_vcpus, 1);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    v->vmemrange[0].start = 0;
>> +    v->vmemrange[0].end = d->max_pages << PAGE_SHIFT;
>
> Didn't we settle on using inclusive ranges to avoid problems at the
> address space end? Or was that in the context of some other series
> (likely by someone else)?

Yes, I think it was from Arianna.
http://lists.xen.org/archives/html/xen-devel/2014-04/msg02641.html

>
>> +long vnuma_init(const struct xen_domctl_vnuma *u_vnuma,
>> +                const struct domain *d,
>> +                struct vnuma_info **dst)
>> +{
>> +    unsigned int dist_size, nr_vnodes = 0;
>> +    long ret;
>> +    struct vnuma_info *v = NULL;
>> +
>> +    ret = -EINVAL;
>> +
>> +    /* If vNUMA topology already set, just exit. */
>> +    if ( !u_vnuma || *dst )
>> +        return ret;
>> +
>> +    nr_vnodes = u_vnuma->nr_vnodes;
>> +
>> +    if ( nr_vnodes == 0 )
>> +        return ret;
>> +
>> +    if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
>> +        return ret;
>> +
>> +    dist_size = nr_vnodes * nr_vnodes;
>> +
>> +    ret = vnuma_alloc(&v, nr_vnodes, d->max_vcpus, dist_size);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    /* On failure, set only one vNUMA node and its success. */
>> +    ret = 0;
>
> Pointless - just use "return 0" below.
>
>> +
>> +    if ( copy_from_guest(v->vdistance, u_vnuma->vdistance, dist_size) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vmemrange, u_vnuma->vmemrange, nr_vnodes) )
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vcpu_to_vnode, u_vnuma->vcpu_to_vnode,
>> +        d->max_vcpus) )
>
> Indentation.
>
>> +        goto vnuma_onenode;
>> +    if ( copy_from_guest(v->vnode_to_pnode, u_vnuma->vnode_to_pnode,
>> +        nr_vnodes) )
>
> Again.
>
>> +        goto vnuma_onenode;
>> +
>> +    v->nr_vnodes = nr_vnodes;
>> +    *dst = v;
>> +
>> +    return ret;
>> +
>> +vnuma_onenode:
>
> Labels are to be indented by at least one space.
>
>> +    case XEN_DOMCTL_setvnumainfo:
>> +    {
>> +        struct vnuma_info *v = NULL;
>> +
>> +        ret = -EFAULT;
>> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
>> +            guest_handle_is_null(op->u.vnuma.vmemrange)      ||
>> +            guest_handle_is_null(op->u.vnuma.vcpu_to_vnode)  ||
>> +            guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
>> +            return ret;
>> +
>> +        ret = -EINVAL;
>> +
>> +        ret = vnuma_init(&op->u.vnuma, d, &v);
>> +        if ( ret < 0 || v == NULL )
>
> So when v == NULL you return success (ret being 0)? That second
> check is either pointless (could become an ASSERT()) or needs proper
> handling.
>
>> +            break;
>> +
>> +        /* overwrite vnuma for domain */
>> +        if ( !d->vnuma )
>> +            vnuma_destroy(d->vnuma);
>> +
>> +        domain_lock(d);
>> +        d->vnuma = v;
>> +        domain_unlock(d);
>
> domain_lock()? What does this guard against? (We generally aim at
> removing uses of domain_lock() rather than adding new ones.)
>
>> +    case XENMEM_get_vnumainfo:
>> +    {
>> +        struct vnuma_topology_info topology;
>> +        struct domain *d;
>> +        unsigned int dom_vnodes = 0;
>
> Pointless initializer.
>
>> +
>> +        /*
>> +         * guest passes nr_vnodes and nr_vcpus thus
>> +         * we know how much memory guest has allocated.
>> +         */
>> +        if ( copy_from_guest(&topology, arg, 1) ||
>> +            guest_handle_is_null(topology.vmemrange.h) ||
>> +            guest_handle_is_null(topology.vdistance.h) ||
>> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )
>> +            return -EFAULT;
>> +
>> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
>> +            return -ESRCH;
>> +
>> +        rc = -EOPNOTSUPP;
>> +        if ( d->vnuma == NULL )
>> +            goto vnumainfo_out;
>> +
>> +        if ( d->vnuma->nr_vnodes == 0 )
>> +            goto vnumainfo_out;
>
> Can this second condition validly (other than due to a race) be true if
> the first one wasn't? (And of course there's synchronization missing
> here, to avoid the race.)

My idea of using pair domain_lock and rcu_lock_domain_by_any_id was to
avoid that race.
I used domain_lock in domctl hypercall when the pointer to vnuma of a
domain is being set.
XENMEM_get_vnumainfo reads the values and hold the reader lock of that domain.
As setting vnuma happens once on booting domain,  domain_lock seemed
to be ok here.
Would be a spinlock more appropriate here?

>
>> +
>> +        dom_vnodes = d->vnuma->nr_vnodes;
>> +
>> +        /*
>> +         * guest nr_cpus and nr_nodes may differ from domain vnuma config.
>> +         * Check here guest nr_nodes and nr_cpus to make sure we dont overflow.
>> +         */
>> +        rc = -ENOBUFS;
>> +        if ( topology.nr_vnodes < dom_vnodes ||
>> +            topology.nr_vcpus < d->max_vcpus )
>> +            goto vnumainfo_out;
>
> You ought to be passing back the needed values in the structure fields.

Understood.
>
>> +
>> +        rc = -EFAULT;
>> +
>> +        if ( copy_to_guest(topology.vmemrange.h, d->vnuma->vmemrange,
>> +                           dom_vnodes) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        if ( copy_to_guest(topology.vdistance.h, d->vnuma->vdistance,
>> +                           dom_vnodes * dom_vnodes) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        if ( copy_to_guest(topology.vcpu_to_vnode.h, d->vnuma->vcpu_to_vnode,
>> +                           d->max_vcpus) != 0 )
>> +            goto vnumainfo_out;
>> +
>> +        topology.nr_vnodes = dom_vnodes;
>
> And why not topology.nr_vcpus?

Yes, I missed that one.

>
>> +
>> +        if ( copy_to_guest(arg, &topology, 1) != 0 )
>
> __copy_to_guest() please.
>
> Jan



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/10] xen: vnuma topology and subop hypercalls
  2014-07-25  4:52     ` Elena Ufimtseva
@ 2014-07-25  7:33       ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2014-07-25  7:33 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Li Yechen, George Dunlap, Matt Wilson,
	Dario Faggioli, Stefano Stabellini, Ian Jackson, xen-devel

>>> On 25.07.14 at 06:52, <ufimtseva@gmail.com> wrote:
> On Wed, Jul 23, 2014 at 10:06 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 18.07.14 at 07:50, <ufimtseva@gmail.com> wrote:
>>> +static int vnuma_alloc(struct vnuma_info **vnuma,
>>> +                       unsigned int nr_vnodes,
>>> +                       unsigned int nr_vcpus,
>>> +                       unsigned int dist_size)
>>> +{
>>> +    struct vnuma_info *v;
>>> +
>>> +    if ( vnuma && *vnuma )
>>> +        return -EINVAL;
>>> +
>>> +    v = *vnuma;
>>> +    /*
>>> +     * check if any of xmallocs exeeds PAGE_SIZE.
>>> +     * If yes, consider it as an error for now.
>>> +     */
>>> +    if ( nr_vnodes > PAGE_SIZE / sizeof(nr_vnodes)       ||
>>> +        nr_vcpus > PAGE_SIZE / sizeof(nr_vcpus)          ||
>>> +        nr_vnodes > PAGE_SIZE / sizeof(struct vmemrange) ||
>>> +        dist_size > PAGE_SIZE / sizeof(dist_size) )
>>
>> Three of the four checks are rather bogus - the types of the
>> variables just happen to match the types of the respective
>> array elements. Best to switch all of them to sizeof(*v->...).
>> Plus I'm not sure about the dist_size check - in its current shape
>> it's redundant with the nr_vnodes one (and really the function
>> parameter seems pointless, i.e. could be calculated here), and
>> it's questionable whether limiting that table against PAGE_SIZE
>> isn't too restrictive. Also indentation seems broken here.
> 
> I agree on distance table memory allocation limit.
> The max vdistance (in current interface) table dimension will be
> effectively 256 after nr_vnodes size check,
> so the max vNUMA nodes. Thus vdistance table will need to allocate 2
> pages of 4K size.
> Will be that viewed as a potential candidate to a list of affected
> hypercalls in XSA-77?

That list isn't permitted to be extended, so the multi-page allocation
needs to be avoided. And a 2-page allocation wouldn't mean a
security problem (after all the allocation still has a deterministic upper
bound) - it's a functionality one. Just allocate separate pages and
vmap() them.

>>> +
>>> +        /*
>>> +         * guest passes nr_vnodes and nr_vcpus thus
>>> +         * we know how much memory guest has allocated.
>>> +         */
>>> +        if ( copy_from_guest(&topology, arg, 1) ||
>>> +            guest_handle_is_null(topology.vmemrange.h) ||
>>> +            guest_handle_is_null(topology.vdistance.h) ||
>>> +            guest_handle_is_null(topology.vcpu_to_vnode.h) )
>>> +            return -EFAULT;
>>> +
>>> +        if ( (d = rcu_lock_domain_by_any_id(topology.domid)) == NULL )
>>> +            return -ESRCH;
>>> +
>>> +        rc = -EOPNOTSUPP;
>>> +        if ( d->vnuma == NULL )
>>> +            goto vnumainfo_out;
>>> +
>>> +        if ( d->vnuma->nr_vnodes == 0 )
>>> +            goto vnumainfo_out;
>>
>> Can this second condition validly (other than due to a race) be true if
>> the first one wasn't? (And of course there's synchronization missing
>> here, to avoid the race.)
> 
> My idea of using pair domain_lock and rcu_lock_domain_by_any_id was to
> avoid that race.

rcu_lock_domain_by_any_id() only guarantees the domain to not
go away under your feet. It means nothing towards a racing
update of d->vnuma.

> I used domain_lock in domctl hypercall when the pointer to vnuma of a
> domain is being set.

Right, but only protecting the writer side isn't providing any
synchronization.

> XENMEM_get_vnumainfo reads the values and hold the reader lock of that 
> domain.
> As setting vnuma happens once on booting domain,  domain_lock seemed
> to be ok here.
> Would be a spinlock more appropriate here?

Or an rw lock (if no lockless mechanism can be found).

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA
  2014-07-18  5:50 ` [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA Elena Ufimtseva
  2014-07-18 10:33   ` Wei Liu
@ 2014-07-29 10:33   ` Ian Campbell
  1 sibling, 0 replies; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:33 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> With the introduction of the XEN_DOMCTL_setvnumainfo
> in patch titled: "xen: vnuma topology and subop hypercalls"
> we put in the plumbing here to use from the toolstack. The user
> is allowed to call this multiple times if they wish so.
> It will error out if the nr_vnodes or nr_vcpus is zero.
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  tools/libxc/xc_domain.c |   63 +++++++++++++++++++++++++++++++++++++++++++++++
>  tools/libxc/xenctrl.h   |    9 +++++++
>  2 files changed, 72 insertions(+)
> 
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> index 0230c6c..a5625b5 100644
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -2123,6 +2123,69 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t domid,
>      return do_domctl(xch, &domctl);
>  }
>  
> +/* Plumbing Xen with vNUMA topology */
> +int xc_domain_setvnuma(xc_interface *xch,
> +                        uint32_t domid,
> +                        uint16_t nr_vnodes,
> +                        uint16_t nr_vcpus,
> +                        vmemrange_t *vmemrange,
> +                        unsigned int *vdistance,
> +                        unsigned int *vcpu_to_vnode,
> +                        unsigned int *vnode_to_pnode)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
> +                                    nr_vnodes * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
> +                                    nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    if ( nr_vnodes == 0 ) {

{'s on the next line please (througout)

> +        errno = EINVAL;
> +        return -1;
> +    }
> +
> +    if ( !vdistance || !vcpu_to_vnode || !vmemrange || !vnode_to_pnode ) {
> +        PERROR("%s: Cant set vnuma without initializing topology", __func__);
> +        errno = EINVAL;
> +        return -1;
> +    }
> +
> +    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
> +         xc_hypercall_bounce_pre(xch, vdistance)      ||
> +         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
> +         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
> +        PERROR("%s: Could not bounce buffers!", __func__);
> +        errno = EFAULT;

You will leak whichever ones of these succeeded before the failure. You
can set rc and goto an out label on the exit path which already does the
cleanup by calling bounce_post.

> +        return -1;
> +    }
> +
> +    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
> +    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
> +    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode);
> +    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode);
> +
> +    domctl.cmd = XEN_DOMCTL_setvnumainfo;
> +    domctl.domain = (domid_t)domid;
> +    domctl.u.vnuma.nr_vnodes = nr_vnodes;
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    xc_hypercall_bounce_post(xch, vmemrange);
> +    xc_hypercall_bounce_post(xch, vdistance);
> +    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
> +    xc_hypercall_bounce_post(xch, vnode_to_pnode);
> +
> +    if ( rc )
> +        errno = EFAULT;

Why override the errno from do_domctl? Surely there are other failure
modes than EFAULT which can occur?

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
  2014-07-18 10:53   ` Wei Liu
@ 2014-07-29 10:38   ` Ian Campbell
  2014-07-29 10:42   ` Ian Campbell
  2 siblings, 0 replies; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:38 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index de25f42..5876822 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -318,7 +318,11 @@ libxl_domain_build_info = Struct("domain_build_info",[
>      ("disable_migrate", libxl_defbool),
>      ("cpuid",           libxl_cpuid_policy_list),
>      ("blkdev_start",    string),
> -    
> +    ("vnuma_mem",     Array(uint64, "nr_nodes")),
> +    ("vnuma_vcpumap",     Array(uint32, "nr_nodemap")),
> +    ("vdistance",        Array(uint32, "nr_dist")),
> +    ("vnuma_vnodemap",  Array(uint32, "nr_node_to_pnode")),

Alignment of the Array please.

Also tools/libxl/idl.txt says:

  idl.Array.len_var contains an idl.Field which is added to the parent
  idl.Aggregate and will contain the length of the array. The field
  MUST be named num_ARRAYNAME.

So these need to be nr_vmuma_mem etc.

You also need to add a LIBXL_HAVE_* define to libxl.h to flag the
availability of this interface to the consumers. (Only add the define,
neither libxl nor xl should consume it, it's for external users only)

This series seems to add the libxl interface changes and the xl
consumer, but not the actual libxl implementation. I think you probably
need to reorder the series a bit. Normally we would want to see the
libxl changes (headers and implementation) first followed by the xl
change to use the new interface.

> diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
> new file mode 100644
> index 0000000..4ff4c57
> --- /dev/null
> +++ b/tools/libxl/libxl_vnuma.h
> @@ -0,0 +1,8 @@
> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#define VNUMA_NO_NODE ~((unsigned int)0)
> +
> +/* Max vNUMA node size from Linux. */
> +#define MIN_VNODE_SIZE  32U

Max or min?

> +
> +#define MAX_VNUMA_NODES (unsigned int)1 << 10

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 06/10] libxc: move code to arch_boot_alloc func
  2014-07-18  5:50 ` [PATCH v6 06/10] libxc: move code to arch_boot_alloc func Elena Ufimtseva
@ 2014-07-29 10:38   ` Ian Campbell
  0 siblings, 0 replies; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:38 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> No functional changes, just moving code.
> Prepare for next patch "libxc: allocate
> domain memory for vnuma enabled domains"
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
  2014-07-18 10:53   ` Wei Liu
  2014-07-29 10:38   ` Ian Campbell
@ 2014-07-29 10:42   ` Ian Campbell
  2014-08-06  4:46     ` Elena Ufimtseva
  2 siblings, 1 reply; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:42 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
> new file mode 100644
> index 0000000..4ff4c57
> --- /dev/null
> +++ b/tools/libxl/libxl_vnuma.h
> @@ -0,0 +1,8 @@

Needs the normal guards against repeated inclusion.

Is this intended to be an internal header? If it is to be used by
applications then the things which it defines should be correctly
namespaced.

> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#define VNUMA_NO_NODE ~((unsigned int)0)
> +
> +/* Max vNUMA node size from Linux. */
> +#define MIN_VNODE_SIZE  32U
> +
> +#define MAX_VNUMA_NODES (unsigned int)1 << 10

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled
  2014-07-18  5:50 ` [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled Elena Ufimtseva
@ 2014-07-29 10:43   ` Ian Campbell
  2014-08-06  4:48     ` Elena Ufimtseva
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:43 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> index 40d3408..23267ed 100644
> --- a/tools/libxc/xc_dom_x86.c
> +++ b/tools/libxc/xc_dom_x86.c
> @@ -756,26 +756,6 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
>      return rc;
>  }
>  
> -int arch_boot_alloc(struct xc_dom_image *dom)
> -{
> -        int rc = 0;
> -        xen_pfn_t allocsz, i;
> -
> -        /* allocate guest memory */
> -        for ( i = rc = allocsz = 0;
> -              (i < dom->total_pages) && !rc;
> -              i += allocsz )
> -        {
> -            allocsz = dom->total_pages - i;
> -            if ( allocsz > 1024*1024 )
> -                allocsz = 1024*1024;
> -            rc = xc_domain_populate_physmap_exact(
> -                dom->xch, dom->guest_domid, allocsz,
> -                0, 0, &dom->p2m_host[i]);
> -        }
> -        return rc;
> -}

You only just moved this here in the last patch! Please move it to the
right place from the beginning.

> -
>  int arch_setup_meminit(struct xc_dom_image *dom)
>  {
>      int rc;
> @@ -832,6 +812,13 @@ int arch_setup_meminit(struct xc_dom_image *dom)
>          for ( pfn = 0; pfn < dom->total_pages; pfn++ )
>              dom->p2m_host[pfn] = pfn;
>  
> +        /* allocate guest memory */
> +        if ( dom->nr_nodes == 0 ) {
> +            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> +                         "%s: Cannot allocate domain memory for 0 vnodes\n",
> +                         __FUNCTION__);

Should this not indicate a system where vnuma is not enabled?

> +            return -EINVAL;
> +        }
>          rc = arch_boot_alloc(dom);
>          if ( rc )
>              return rc;

> diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
> index e593364..21e4a20 100644
> --- a/tools/libxc/xg_private.h
> +++ b/tools/libxc/xg_private.h
> @@ -123,6 +123,7 @@ typedef uint64_t l4_pgentry_64_t;
>  #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
>  #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
>  
> +#define VNUMA_NO_NODE ~((unsigned int)0)

This was defined in a previous patch too, in a libxl header, I think.

If this is not to be exposed to libxl users then you don't need the
libxl copy at all -- you should add this to a public libxc header, with
a suitable namespace prefix, and libxl can consume it from there.

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-07-18  5:50 ` [PATCH v6 10/10] libxl: set vnuma for domain Elena Ufimtseva
  2014-07-18 10:58   ` Wei Liu
@ 2014-07-29 10:45   ` Ian Campbell
  2014-08-12  3:52     ` Elena Ufimtseva
  1 sibling, 1 reply; 63+ messages in thread
From: Ian Campbell @ 2014-07-29 10:45 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, stefano.stabellini, george.dunlap, msw, dario.faggioli,
	lccycc123, ian.jackson, xen-devel, JBeulich

On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> Call xc_domain_setvnuma to set vnuma topology for domain.
> Prepares xc_dom_image for domain bootmem memory allocation
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  tools/libxl/libxl.c |   22 ++++++++++++++++++++++
>  tools/libxl/libxl.h |   19 +++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 39f1c28..e9f2607 100644
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -4807,6 +4807,28 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid,
>      return 0;
>  }
>  
> +int libxl_domain_setvnuma(libxl_ctx *ctx,
> +                            uint32_t domid,

Can this be done on an existing domain? I'd have expected this to be an
internal function which is called from the inside of the domain creation
machinery.

Does anything use this? If yes then since this is the last patch you
must have introduced a bisection hazard.

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc
  2014-07-29 10:42   ` Ian Campbell
@ 2014-08-06  4:46     ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-08-06  4:46 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir Fraser, Stefano Stabellini, George Dunlap, Matt Wilson,
	Dario Faggioli, Li Yechen, Ian Jackson, xen-devel, Jan Beulich

On Tue, Jul 29, 2014 at 6:42 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
>> diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
>> new file mode 100644
>> index 0000000..4ff4c57
>> --- /dev/null
>> +++ b/tools/libxl/libxl_vnuma.h
>> @@ -0,0 +1,8 @@
>
> Needs the normal guards against repeated inclusion.
>
> Is this intended to be an internal header? If it is to be used by
> applications then the things which it defines should be correctly
> namespaced.
>

Thanks Ian for your review.
Sorry for late response, I had to work on some day job related things.
I will address your comments in the next series of patches.

Elena

>> +#include "libxl_osdeps.h" /* must come before any other headers */
>> +
>> +#define VNUMA_NO_NODE ~((unsigned int)0)
>> +
>> +/* Max vNUMA node size from Linux. */
>> +#define MIN_VNODE_SIZE  32U
>> +
>> +#define MAX_VNUMA_NODES (unsigned int)1 << 10
>
> Ian.
>



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled
  2014-07-29 10:43   ` Ian Campbell
@ 2014-08-06  4:48     ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-08-06  4:48 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir Fraser, Stefano Stabellini, George Dunlap, Matt Wilson,
	Dario Faggioli, Li Yechen, Ian Jackson, xen-devel, Jan Beulich

On Tue, Jul 29, 2014 at 6:43 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
>> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
>> index 40d3408..23267ed 100644
>> --- a/tools/libxc/xc_dom_x86.c
>> +++ b/tools/libxc/xc_dom_x86.c
>> @@ -756,26 +756,6 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
>>      return rc;
>>  }
>>
>> -int arch_boot_alloc(struct xc_dom_image *dom)
>> -{
>> -        int rc = 0;
>> -        xen_pfn_t allocsz, i;
>> -
>> -        /* allocate guest memory */
>> -        for ( i = rc = allocsz = 0;
>> -              (i < dom->total_pages) && !rc;
>> -              i += allocsz )
>> -        {
>> -            allocsz = dom->total_pages - i;
>> -            if ( allocsz > 1024*1024 )
>> -                allocsz = 1024*1024;
>> -            rc = xc_domain_populate_physmap_exact(
>> -                dom->xch, dom->guest_domid, allocsz,
>> -                0, 0, &dom->p2m_host[i]);
>> -        }
>> -        return rc;
>> -}
>
> You only just moved this here in the last patch! Please move it to the
> right place from the beginning.
>
>> -
>>  int arch_setup_meminit(struct xc_dom_image *dom)
>>  {
>>      int rc;
>> @@ -832,6 +812,13 @@ int arch_setup_meminit(struct xc_dom_image *dom)
>>          for ( pfn = 0; pfn < dom->total_pages; pfn++ )
>>              dom->p2m_host[pfn] = pfn;
>>
>> +        /* allocate guest memory */
>> +        if ( dom->nr_nodes == 0 ) {
>> +            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
>> +                         "%s: Cannot allocate domain memory for 0 vnodes\n",
>> +                         __FUNCTION__);
>
> Should this not indicate a system where vnuma is not enabled?
>
>> +            return -EINVAL;
>> +        }
>>          rc = arch_boot_alloc(dom);
>>          if ( rc )
>>              return rc;
>
>> diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
>> index e593364..21e4a20 100644
>> --- a/tools/libxc/xg_private.h
>> +++ b/tools/libxc/xg_private.h
>> @@ -123,6 +123,7 @@ typedef uint64_t l4_pgentry_64_t;
>>  #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
>>  #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
>>
>> +#define VNUMA_NO_NODE ~((unsigned int)0)
>
> This was defined in a previous patch too, in a libxl header, I think.
>
> If this is not to be exposed to libxl users then you don't need the
> libxl copy at all -- you should add this to a public libxc header, with
> a suitable namespace prefix, and libxl can consume it from there.
>
> Ian.
>

Thank Ian,
will address these as well.

-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-07-29 10:45   ` Ian Campbell
@ 2014-08-12  3:52     ` Elena Ufimtseva
  2014-08-12  9:42       ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Elena Ufimtseva @ 2014-08-12  3:52 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir Fraser, Stefano Stabellini, George Dunlap, Matt Wilson,
	Dario Faggioli, Li Yechen, Ian Jackson, xen-devel, Jan Beulich

On Tue, Jul 29, 2014 at 6:45 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
>> Call xc_domain_setvnuma to set vnuma topology for domain.
>> Prepares xc_dom_image for domain bootmem memory allocation
>>
>> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
>> ---
>>  tools/libxl/libxl.c |   22 ++++++++++++++++++++++
>>  tools/libxl/libxl.h |   19 +++++++++++++++++++
>>  2 files changed, 41 insertions(+)
>>
>> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
>> index 39f1c28..e9f2607 100644
>> --- a/tools/libxl/libxl.c
>> +++ b/tools/libxl/libxl.c
>> @@ -4807,6 +4807,28 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid,
>>      return 0;
>>  }
>>
>> +int libxl_domain_setvnuma(libxl_ctx *ctx,
>> +                            uint32_t domid,
>
> Can this be done on an existing domain? I'd have expected this to be an
> internal function which is called from the inside of the domain creation
> machinery.
>
> Does anything use this? If yes then since this is the last patch you
> must have introduced a bisection hazard.
>
> Ian.
>

Ian, Wei

Preparing the series, I figured that this particular function
libxl_domain_setvnuma does not have any users. Instead, xc one is
called directly.
I want to omit this patch. Do I need to have  #define
LIBXL_HAVE_DOMAIN_SETVNUMA 1 in this case?

Thank you.

-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-08-12  3:52     ` Elena Ufimtseva
@ 2014-08-12  9:42       ` Wei Liu
  2014-08-12 17:10         ` Dario Faggioli
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-08-12  9:42 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich, wei.liu2

On Mon, Aug 11, 2014 at 11:52:55PM -0400, Elena Ufimtseva wrote:
> On Tue, Jul 29, 2014 at 6:45 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2014-07-18 at 01:50 -0400, Elena Ufimtseva wrote:
> >> Call xc_domain_setvnuma to set vnuma topology for domain.
> >> Prepares xc_dom_image for domain bootmem memory allocation
> >>
> >> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> >> ---
> >>  tools/libxl/libxl.c |   22 ++++++++++++++++++++++
> >>  tools/libxl/libxl.h |   19 +++++++++++++++++++
> >>  2 files changed, 41 insertions(+)
> >>
> >> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> >> index 39f1c28..e9f2607 100644
> >> --- a/tools/libxl/libxl.c
> >> +++ b/tools/libxl/libxl.c
> >> @@ -4807,6 +4807,28 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid,
> >>      return 0;
> >>  }
> >>
> >> +int libxl_domain_setvnuma(libxl_ctx *ctx,
> >> +                            uint32_t domid,
> >
> > Can this be done on an existing domain? I'd have expected this to be an
> > internal function which is called from the inside of the domain creation
> > machinery.
> >
> > Does anything use this? If yes then since this is the last patch you
> > must have introduced a bisection hazard.
> >
> > Ian.
> >
> 
> Ian, Wei
> 
> Preparing the series, I figured that this particular function
> libxl_domain_setvnuma does not have any users. Instead, xc one is
> called directly.
> I want to omit this patch. Do I need to have  #define
> LIBXL_HAVE_DOMAIN_SETVNUMA 1 in this case?
> 

If your macro is specifically for this function then you need to remove
macro as well. In this case I think the answer is "no you don't need
that in libxl.h", unless I've mistaken the purpose of your macro.

Wei.

> Thank you.
> 
> -- 
> Elena
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-08-12  9:42       ` Wei Liu
@ 2014-08-12 17:10         ` Dario Faggioli
  2014-08-12 17:13           ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Dario Faggioli @ 2014-08-12 17:10 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Elena Ufimtseva


[-- Attachment #1.1: Type: text/plain, Size: 1879 bytes --]

On mar, 2014-08-12 at 10:42 +0100, Wei Liu wrote:
> On Mon, Aug 11, 2014 at 11:52:55PM -0400, Elena Ufimtseva wrote:

> > Preparing the series, I figured that this particular function
> > libxl_domain_setvnuma does not have any users. Instead, xc one is
> > called directly.
> > I want to omit this patch. Do I need to have  #define
> > LIBXL_HAVE_DOMAIN_SETVNUMA 1 in this case?
> > 
> 
> If your macro is specifically for this function then you need to remove
> macro as well. In this case I think the answer is "no you don't need
> that in libxl.h", unless I've mistaken the purpose of your macro.
> 
> Wei.

It was something you (Wei) suggested. :-P :-P :-P

(Cutting and pasting from
http://bugs.xenproject.org/xen/mid/%3C20140718105838.GE7142@zion.uk.xensource.com%3E )
---
>  int libxl_fd_set_cloexec(libxl_ctx *ctx, int fd, int cloexec);
>  int libxl_fd_set_nonblock(libxl_ctx *ctx, int fd, int nonblock);
>  
> +int libxl_domain_setvnuma(libxl_ctx *ctx,
> +                           uint32_t domid,
> +                           uint16_t nr_vnodes,
> +                           uint16_t nr_vcpus,
> +                           vmemrange_t *vmemrange,
> +                           unsigned int *vdistance,
> +                           unsigned int *vcpu_to_vnode,
> +                           unsigned int *vnode_to_pnode);
> +
>  #include <libxl_event.h>
>  

You will need to add
  #define LIBXL_HAVE_DOMAIN_SETVNUMA 1
to advertise the introduction of new API.

Wei.
---

Anyway, I agree: if the function goes, no need to add any macro.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-08-12 17:10         ` Dario Faggioli
@ 2014-08-12 17:13           ` Wei Liu
  2014-08-12 17:24             ` Elena Ufimtseva
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2014-08-12 17:13 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Li Yechen, Ian Jackson, xen-devel, Jan Beulich,
	Elena Ufimtseva, Wei Liu

On Tue, Aug 12, 2014 at 07:10:37PM +0200, Dario Faggioli wrote:
> On mar, 2014-08-12 at 10:42 +0100, Wei Liu wrote:
> > On Mon, Aug 11, 2014 at 11:52:55PM -0400, Elena Ufimtseva wrote:
> 
> > > Preparing the series, I figured that this particular function
> > > libxl_domain_setvnuma does not have any users. Instead, xc one is
> > > called directly.
> > > I want to omit this patch. Do I need to have  #define
> > > LIBXL_HAVE_DOMAIN_SETVNUMA 1 in this case?
> > > 
> > 
> > If your macro is specifically for this function then you need to remove
> > macro as well. In this case I think the answer is "no you don't need
> > that in libxl.h", unless I've mistaken the purpose of your macro.
> > 
> > Wei.
> 
> It was something you (Wei) suggested. :-P :-P :-P
> 

Yes, I saw my previous reply after I sent my email. :-)

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/10] libxl: set vnuma for domain
  2014-08-12 17:13           ` Wei Liu
@ 2014-08-12 17:24             ` Elena Ufimtseva
  0 siblings, 0 replies; 63+ messages in thread
From: Elena Ufimtseva @ 2014-08-12 17:24 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich

On Tue, Aug 12, 2014 at 1:13 PM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, Aug 12, 2014 at 07:10:37PM +0200, Dario Faggioli wrote:
>> On mar, 2014-08-12 at 10:42 +0100, Wei Liu wrote:
>> > On Mon, Aug 11, 2014 at 11:52:55PM -0400, Elena Ufimtseva wrote:
>>
>> > > Preparing the series, I figured that this particular function
>> > > libxl_domain_setvnuma does not have any users. Instead, xc one is
>> > > called directly.
>> > > I want to omit this patch. Do I need to have  #define
>> > > LIBXL_HAVE_DOMAIN_SETVNUMA 1 in this case?
>> > >
>> >
>> > If your macro is specifically for this function then you need to remove
>> > macro as well. In this case I think the answer is "no you don't need
>> > that in libxl.h", unless I've mistaken the purpose of your macro.
>> >
>> > Wei.
>>
>> It was something you (Wei) suggested. :-P :-P :-P
>>
>
> Yes, I saw my previous reply after I sent my email. :-)

That's fine :) I understood what Wei meant to say.

>
> Wei.



-- 
Elena

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2014-08-12 17:24 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-18  5:49 [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
2014-07-18  5:50 ` [PATCH v6 01/10] xen: vnuma topology and subop hypercalls Elena Ufimtseva
2014-07-18 10:30   ` Wei Liu
2014-07-20 13:16     ` Elena Ufimtseva
2014-07-20 15:59       ` Wei Liu
2014-07-22 15:18         ` Dario Faggioli
2014-07-23  5:33           ` Elena Ufimtseva
2014-07-18 13:49   ` Konrad Rzeszutek Wilk
2014-07-20 13:26     ` Elena Ufimtseva
2014-07-22 15:14   ` Dario Faggioli
2014-07-23  5:22     ` Elena Ufimtseva
2014-07-23 14:06   ` Jan Beulich
2014-07-25  4:52     ` Elena Ufimtseva
2014-07-25  7:33       ` Jan Beulich
2014-07-18  5:50 ` [PATCH v6 02/10] xsm bits for vNUMA hypercalls Elena Ufimtseva
2014-07-18 13:50   ` Konrad Rzeszutek Wilk
2014-07-18 15:26     ` Daniel De Graaf
2014-07-20 13:48       ` Elena Ufimtseva
2014-07-18  5:50 ` [PATCH v6 03/10] vnuma hook to debug-keys u Elena Ufimtseva
2014-07-23 14:10   ` Jan Beulich
2014-07-18  5:50 ` [PATCH v6 04/10] libxc: Introduce xc_domain_setvnuma to set vNUMA Elena Ufimtseva
2014-07-18 10:33   ` Wei Liu
2014-07-29 10:33   ` Ian Campbell
2014-07-18  5:50 ` [PATCH v6 05/10] libxl: vnuma topology configuration parser and doc Elena Ufimtseva
2014-07-18 10:53   ` Wei Liu
2014-07-20 14:04     ` Elena Ufimtseva
2014-07-29 10:38   ` Ian Campbell
2014-07-29 10:42   ` Ian Campbell
2014-08-06  4:46     ` Elena Ufimtseva
2014-07-18  5:50 ` [PATCH v6 06/10] libxc: move code to arch_boot_alloc func Elena Ufimtseva
2014-07-29 10:38   ` Ian Campbell
2014-07-18  5:50 ` [PATCH v6 07/10] libxc: allocate domain memory for vnuma enabled Elena Ufimtseva
2014-07-29 10:43   ` Ian Campbell
2014-08-06  4:48     ` Elena Ufimtseva
2014-07-18  5:50 ` [PATCH v6 08/10] libxl: build numa nodes memory blocks Elena Ufimtseva
2014-07-18 11:01   ` Wei Liu
2014-07-20 12:58     ` Elena Ufimtseva
2014-07-20 15:59       ` Wei Liu
2014-07-18  5:50 ` [PATCH v6 09/10] libxl: vnuma nodes placement bits Elena Ufimtseva
2014-07-18  5:50 ` [PATCH v6 10/10] libxl: set vnuma for domain Elena Ufimtseva
2014-07-18 10:58   ` Wei Liu
2014-07-29 10:45   ` Ian Campbell
2014-08-12  3:52     ` Elena Ufimtseva
2014-08-12  9:42       ` Wei Liu
2014-08-12 17:10         ` Dario Faggioli
2014-08-12 17:13           ` Wei Liu
2014-08-12 17:24             ` Elena Ufimtseva
2014-07-18  6:16 ` [PATCH v6 00/10] vnuma introduction Elena Ufimtseva
2014-07-18  9:53 ` Wei Liu
2014-07-18 10:13   ` Dario Faggioli
2014-07-18 11:48     ` Wei Liu
2014-07-20 14:57       ` Elena Ufimtseva
2014-07-22 15:49         ` Dario Faggioli
2014-07-22 14:03       ` Dario Faggioli
2014-07-22 14:48         ` Wei Liu
2014-07-22 15:06           ` Dario Faggioli
2014-07-22 16:47             ` Wei Liu
2014-07-22 19:43         ` Is: cpuid creation of PV guests is not correct. Was:Re: " Konrad Rzeszutek Wilk
2014-07-22 22:34           ` Is: cpuid creation of PV guests is not correct Andrew Cooper
2014-07-22 22:53           ` Is: cpuid creation of PV guests is not correct. Was:Re: [PATCH v6 00/10] vnuma introduction Dario Faggioli
2014-07-23  6:00             ` Elena Ufimtseva
2014-07-22 12:49 ` Dario Faggioli
2014-07-23  5:59   ` Elena Ufimtseva

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.