All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] vnuma introduction
@ 2014-06-03  4:53 Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info for debug-key Elena Ufimtseva
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

v5 of the patchset is mostly to bring back to life conversation about vnuma
and to set an intention to be included in Xen rc4.5 (along woth dom0).
libxl part will be modified to alignt with Wei Liu work.
vnuma placement mechanism is a subject to discusstion.
Your comments are welcome.

vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.

XEN vNUMA support

Current set of patches introduces subop hypercall that is available for enlightened
PV guests with vNUMA patches applied.

Domain structure was modified to reflect per-domain vNUMA topology for use in other
vNUMA-aware subsystems (e.g. ballooning).

libxc

libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA
machines provides initial memory allocation on physical NUMA nodes. This implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.

libxl

libxl provides a way to predefine in VM config vNUMA topology - number of vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.

PV guest

As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches
should be applied and NUMA support should be compiled in kernel.

Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

1. Automatic vNUMA placement on h/w NUMA machine:

VM config:

memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]

Xen:

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN)     Node 0: 2097152
(XEN)     Node 1: 2097152
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 0  (4096) MB
(XEN)         vnode 1 - pnode 0  (4096) MB
(XEN)         vnode 2 - pnode 1  (4096) MB
(XEN)         vnode 3 - pnode 1  (4096) MB
(XEN)     Domain vcpu to vnode:
(XEN)     0 1 2 3

dmesg on pv guest:

[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xffffffff]
[    0.000000]   node   1: [mem 0x100000000-0x1ffffffff]
[    0.000000]   node   2: [mem 0x200000000-0x2ffffffff]
[    0.000000]   node   3: [mem 0x300000000-0x3ffffffff]
[    0.000000] On node 0 totalpages: 1048479
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 1044480 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3


pv guest: numactl --hardware:

root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.

2. Example with e820_host = 1 (32GB real NUMA machines, two nodes).

pv config:
memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null"
vnodes = 4
#vnumamem = [3000, 1000]
vdistance = [10, 40]
#vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 0]
#vnuma_autoplacement = 1
e820_host = 1

guest boot:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.12.0+ (assert@superpipe) (gcc version 4.7.2 (Debi
an 4.7.2-5) ) #111 SMP Tue Dec 3 14:54:36 EST 2013
[    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8
 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintk=
xen sched_debug
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed
[    0.000000] 1-1 mapping on ac228->100000
[    0.000000] Released 318936 pages of unused memory
[    0.000000] Set 343512 page(s) to 1-1 mapping
[    0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
[    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
[    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
[    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
[    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
[    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
[    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
[    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
[    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable
[    0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS
[    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
[    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
[    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
[    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
[    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
[    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
[    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
[    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] No AGP bridge found
[    0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
[    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff]
[    0.000000]  [mem 0x14da00000-0x14dbfffff] page 4k
[    0.000000] BRK [0x019bd000, 0x019bdfff] PGTABLE
[    0.000000] BRK [0x019be000, 0x019befff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff]
[    0.000000]  [mem 0x14c000000-0x14d9fffff] page 4k
[    0.000000] BRK [0x019bf000, 0x019bffff] PGTABLE
[    0.000000] BRK [0x019c0000, 0x019c0fff] PGTABLE
[    0.000000] BRK [0x019c1000, 0x019c1fff] PGTABLE
[    0.000000] BRK [0x019c2000, 0x019c2fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff]
[    0.000000]  [mem 0x100000000-0x14bffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
[    0.000000]  [mem 0x00100000-0xac227fff] page 4k
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff]
[    0.000000] NUMA: Initialized distance table, cnt=4
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x3e7fffff]
[    0.000000]   NODE_DATA [mem 0x3e7d9000-0x3e7fffff]
[    0.000000] Initmem setup node 1 [mem 0x3e800000-0x7cffffff]
[    0.000000]   NODE_DATA [mem 0x7cfd9000-0x7cffffff]
[    0.000000] Initmem setup node 2 [mem 0x7d000000-0x10f5dffff]
[    0.000000]   NODE_DATA [mem 0x10f5b9000-0x10f5dffff]
[    0.000000] Initmem setup node 3 [mem 0x10f800000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x3e7fffff]
[    0.000000]   node   1: [mem 0x3e800000-0x7cffffff]
[    0.000000]   node   2: [mem 0x7d000000-0xac227fff]
[    0.000000]   node   2: [mem 0x100000000-0x10f5dffff]
[    0.000000]   node   3: [mem 0x10f5e0000-0x14ddd7fff]
[    0.000000] On node 0 totalpages: 255903
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 3444 pages used for memmap
[    0.000000]   DMA32 zone: 251904 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 256000
[    0.000000]   DMA32 zone: 3500 pages used for memmap
[    0.000000]   DMA32 zone: 256000 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 256008
[    0.000000]   DMA32 zone: 2640 pages used for memmap
[    0.000000]   DMA32 zone: 193064 pages, LIFO batch:31
[    0.000000]   Normal zone: 861 pages used for memmap
[    0.000000]   Normal zone: 62944 pages, LIFO batch:15
[    0.000000] On node 3 totalpages: 255992
[    0.000000]   Normal zone: 3500 pages used for memmap
[    0.000000]   Normal zone: 255992 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs

root@heatpipe:~# numactl --ha
available: 4 nodes (0-3)
node 0 cpus: 0 4
node 0 size: 977 MB
node 0 free: 947 MB
node 1 cpus: 1 5
node 1 size: 985 MB
node 1 free: 974 MB
node 2 cpus: 2 6
node 2 size: 985 MB
node 2 free: 973 MB
node 3 cpus: 3 7
node 3 size: 969 MB
node 3 free: 958 MB
node distances:
node   0   1   2   3
  0:  10  40  40  40
  1:  40  10  40  40
  2:  40  40  10  40
  3:  40  40  40  10

root@heatpipe:~# numastat -m

Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2          Node 3           Total
                 --------------- --------------- --------------- --------------- ---------------
MemTotal                  977.14          985.50          985.44          969.91         3917.99

hypervisor: xl debug-keys u

(XEN) 'u' pressed -> dumping numa info (now-0x2A3:F7B8CB0F)
(XEN) Domain 2 (total: 1024000):
(XEN)     Node 0: 415468
(XEN)     Node 1: 608532
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 1 1000 MB, vcpus: 0 4
(XEN)         vnode 1 - pnode 0 1000 MB, vcpus: 1 5
(XEN)         vnode 2 - pnode 1 2341 MB, vcpus: 2 6
(XEN)         vnode 3 - pnode 0 999 MB, vcpus: 3 7

This size descrepancy caused by the way how size if calculated
from guest pfns: end - start. Thus the hole size in this case of
~1,3Gb is included in the size.

3. zero vNUMA configuration for every pv domain.
Will be at least one vnuma node if vnuma topology was not
specified.

pv config:

memory = 4000
vcpus = 8
# The name of the domain, change this if you want more than 1 VM.
name = "null"
#vnodes = 4
vnumamem = [3000, 1000]
vdistance = [10, 40]
vnuma_vcpumap = [1, 0, 3, 2]
vnuma_vnodemap = [1, 0, 1, 0]
vnuma_autoplacement = 1
e820_host = 1

boot:
[    0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff]
[    0.000000]  [mem 0x14dc00000-0x14ddd7fff] page 4k
[    0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff]
[    0.000000] NUMA: Initialized distance table, cnt=1
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x14ddd7fff]
[    0.000000]   NODE_DATA [mem 0x14ddad000-0x14ddd3fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x14ddd7fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xac227fff]
[    0.000000]   node   0: [mem 0x100000000-0x14ddd7fff]

root@heatpipe:~# numactl --ha
maxn: 0
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 3918 MB
node 0 free: 3853 MB
node distances:
node   0
  0:  10

root@heatpipe:~# numastat -m

Per-node system memory usage (in MBs):
                          Node 0           Total
                 --------------- ---------------
MemTotal                 3918.74         3918.74

hypervisor: xl debug-keys u

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 6787432):
(XEN)     Node 0: 3485706
(XEN)     Node 1: 3301726
(XEN) Domain 3 (total: 1024000):
(XEN)     Node 0: 512000
(XEN)     Node 1: 512000
(XEN)     Domain has 1 vnodes
(XEN)         vnode 0 - pnode any 5341 MB, vcpus: 0 1 2 3 4 5 6 7

Patchsets for Xen and linux:

git://gitorious.org/xenvnuma_v5/linuxvnuma_v5.git
https://git.gitorious.org/xenvnuma_v5/linuxvnuma_v5.git

Xen patchset is available at:
git://gitorious.org/xenvnuma_v5/xenvnuma_v5.git
https://git.gitorious.org/xenvnuma_v5/xenvnuma_v5.git

Issues:

Issue with automatic NUMA placement was found and resolved.
New issue arises with recursive spinlock when changin numa
protection. This is being investigated currently.

Elena Ufimtseva (1):
  add vnuma info for debug-key

 xen/arch/x86/numa.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 8/8] add vnuma info for debug-key
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  9:04   ` Jan Beulich
  2014-06-03  4:53 ` [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls Elena Ufimtseva
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/arch/x86/numa.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index b141877..8310b03 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
 	s_time_t now = NOW();
-	int i;
+	int i, j, n, err;
 	struct domain *d;
 	struct page_info *page;
+	char tmp[12];
 	unsigned int page_num_node[MAX_NUMNODES];
 
 	printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
@@ -389,6 +390,32 @@ static void dump_numa(unsigned char key)
 
 		for_each_online_node(i)
 			printk("    Node %u: %u\n", i, page_num_node[i]);
+
+		printk("    Domain has %u vnodes\n", d->vnuma.nr_vnodes);
+		for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) {
+			err = snprintf(tmp, 12, "%u", d->vnuma.vnode_to_pnode[i]);
+			if ( err < 0 )
+				printk("        vnode %u - pnode %s,", i, "any");
+			else
+				printk("        vnode %u - pnode %s,", i,
+			d->vnuma.vnode_to_pnode[i] == NUMA_NO_NODE ? "any" : tmp);
+			printk(" %"PRIu64" MB, ",
+				(d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20);
+			printk("vcpus: ");
+
+			for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
+				if ( d->vnuma.vcpu_to_vnode[j] == i ) {
+					if ( !((n + 1) % 8) )
+						printk("%u\n", j);
+					else if ( !(n % 8) && n != 0 )
+							printk("%s%u ", "             ", j);
+						else
+							printk("%u ", j);
+					n++;
+				}
+			}
+			printk("\n");
+		}
 	}
 
 	rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info for debug-key Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  8:55   ` Jan Beulich
  2014-06-03  4:53 ` [PATCH v5 2/8] libxc: Plumb Xen with vnuma topology Elena Ufimtseva
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Defines interface, structures and hypercalls for toolstack to
build vnuma topology and for guests that wish to retreive it.
Two subop hypercalls introduced by patch:
XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
and XENMEM_get_vnuma_info to retreive that topology by guest.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---

Changes since v4:
- added check to make sure guest has enough memory for vnuma
topology;
- code style fixes;

Changes since v3:
- added subop hypercall to retrive number of vnodes
and vcpus for domain to make correct allocations before
requesting vnuma topology.
---
 xen/common/domain.c         |   26 ++++++++++++++
 xen/common/domctl.c         |   84 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/memory.c         |   67 ++++++++++++++++++++++++++++++++++
 xen/include/public/domctl.h |   28 +++++++++++++++
 xen/include/public/memory.h |   14 ++++++++
 xen/include/public/vnuma.h  |   54 ++++++++++++++++++++++++++++
 xen/include/xen/domain.h    |   11 ++++++
 xen/include/xen/sched.h     |    1 +
 8 files changed, 285 insertions(+)
 create mode 100644 xen/include/public/vnuma.h

diff --git a/xen/common/domain.c b/xen/common/domain.c
index bc57174..5b7ce17 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -567,6 +567,15 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
     return 0;
 }
 
+static void vnuma_destroy(struct vnuma_info *vnuma)
+{
+    vnuma->nr_vnodes = 0;
+    xfree(vnuma->vmemrange);
+    xfree(vnuma->vcpu_to_vnode);
+    xfree(vnuma->vdistance);
+    xfree(vnuma->vnode_to_pnode);
+}
+
 int domain_kill(struct domain *d)
 {
     int rc = 0;
@@ -585,6 +594,7 @@ int domain_kill(struct domain *d)
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem_client);
+        vnuma_destroy(&d->vnuma);
         domain_set_outstanding_pages(d, 0);
         d->tmem_client = NULL;
         /* fallthrough */
@@ -1350,6 +1360,22 @@ int continue_hypercall_on_cpu(
 }
 
 /*
+ * Changes previously set domain vnuma topology to the defalt one
+ * that has one node and all other default values. Since the domain
+ * memory may be at this point allocated on multiple HW NUMA nodes,
+ * NUMA_NO_NODE is set for vnode to pnode mask.
+ */
+int vnuma_init_zero_topology(struct domain *d)
+{
+    d->vnuma.vmemrange[0].end = d->vnuma.vmemrange[d->vnuma.nr_vnodes - 1].end;
+    d->vnuma.vdistance[0] = 10;
+    memset(d->vnuma.vnode_to_pnode, NUMA_NO_NODE, d->vnuma.nr_vnodes);
+    memset(d->vnuma.vcpu_to_vnode, 0, d->max_vcpus);
+    d->vnuma.nr_vnodes = 1;
+    return 0;
+}
+
+/*
  * Local variables:
  * mode: C
  * c-file-style: "BSD"
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 4774277..66fdcee 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -29,6 +29,7 @@
 #include <asm/page.h>
 #include <public/domctl.h>
 #include <xsm/xsm.h>
+#include <public/vnuma.h>
 
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
@@ -888,6 +889,89 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
     }
     break;
 
+    case XEN_DOMCTL_setvnumainfo:
+    {
+        unsigned int dist_size, nr_vnodes;
+
+        ret = -EINVAL;
+
+        /* If number of vnodes was set before, skip */
+        if ( d->vnuma.nr_vnodes > 0 )
+            break;
+
+        nr_vnodes = op->u.vnuma.nr_vnodes;
+        if ( nr_vnodes == 0 )
+            goto setvnumainfo_out;
+
+        if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
+            goto setvnumainfo_out;
+
+        ret = -EFAULT;
+        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
+             guest_handle_is_null(op->u.vnuma.vmemrange)     ||
+             guest_handle_is_null(op->u.vnuma.vcpu_to_vnode) ||
+             guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
+            goto setvnumainfo_out;
+
+        dist_size = nr_vnodes * nr_vnodes;
+
+        d->vnuma.vdistance = xmalloc_array(unsigned int, dist_size);
+        d->vnuma.vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
+        d->vnuma.vcpu_to_vnode = xmalloc_array(unsigned int, d->max_vcpus);
+        d->vnuma.vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
+
+        if ( d->vnuma.vdistance == NULL ||
+             d->vnuma.vmemrange == NULL ||
+             d->vnuma.vcpu_to_vnode == NULL ||
+             d->vnuma.vnode_to_pnode == NULL )
+        {
+            ret = -ENOMEM;
+            goto setvnumainfo_nomem;
+        }
+
+        if ( unlikely(__copy_from_guest(d->vnuma.vdistance,
+                                    op->u.vnuma.vdistance,
+                                    dist_size)) )
+            goto setvnumainfo_out;
+        if ( unlikely(__copy_from_guest(d->vnuma.vmemrange,
+                                    op->u.vnuma.vmemrange,
+                                    nr_vnodes)) )
+            goto setvnumainfo_out;
+        if ( unlikely(__copy_from_guest(d->vnuma.vcpu_to_vnode,
+                                    op->u.vnuma.vcpu_to_vnode,
+                                    d->max_vcpus)) )
+            goto setvnumainfo_out;
+        if ( unlikely(__copy_from_guest(d->vnuma.vnode_to_pnode,
+                                    op->u.vnuma.vnode_to_pnode,
+                                    nr_vnodes)) )
+            goto setvnumainfo_out;
+
+        /* Everything is good, lets set the number of vnodes */
+        d->vnuma.nr_vnodes = nr_vnodes;
+
+        ret = 0;
+        break;
+
+ setvnumainfo_out:
+        /* On failure, set one vNUMA node */
+        d->vnuma.vmemrange[0].end = d->vnuma.vmemrange[d->vnuma.nr_vnodes - 1].end;
+        d->vnuma.vdistance[0] = 10;
+        memset(d->vnuma.vnode_to_pnode, NUMA_NO_NODE, d->vnuma.nr_vnodes);
+        memset(d->vnuma.vcpu_to_vnode, 0, d->max_vcpus);
+        d->vnuma.nr_vnodes = 1;
+        ret = 0;
+        break;
+
+ setvnumainfo_nomem:
+        /* The only case where we set number of vnodes to 0 */
+        d->vnuma.nr_vnodes = 0;
+        xfree(d->vnuma.vmemrange);
+        xfree(d->vnuma.vdistance);
+        xfree(d->vnuma.vnode_to_pnode);
+        xfree(d->vnuma.vcpu_to_vnode);
+    }
+    break;
+
     default:
         ret = arch_do_domctl(op, d, u_domctl);
         break;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 257f4b0..2067f42 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -963,6 +963,73 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         break;
 
+    case XENMEM_get_vnuma_info:
+    {
+        struct vnuma_topology_info guest_topo;
+        struct domain *d;
+
+        if ( copy_from_guest(&guest_topo, arg, 1) )
+            return -EFAULT;
+        if ( (d = rcu_lock_domain_by_any_id(guest_topo.domid)) == NULL )
+            return -ESRCH;
+
+        if ( d->vnuma.nr_vnodes == 0 ) {
+            rc = -EOPNOTSUPP;
+            goto vnumainfo_out;
+        }
+
+        rc = -EOPNOTSUPP;
+        /*
+         * Guest may have different kernel configuration for
+         * number of cpus/nodes. It informs about them via hypercall.
+         */
+        if ( guest_topo.nr_vnodes < d->vnuma.nr_vnodes ||
+            guest_topo.nr_vcpus < d->max_vcpus )
+            goto vnumainfo_out;
+
+        rc = -EFAULT;
+
+        if ( guest_handle_is_null(guest_topo.vmemrange.h)    ||
+             guest_handle_is_null(guest_topo.vdistance.h)    ||
+             guest_handle_is_null(guest_topo.vcpu_to_vnode.h) )
+            goto vnumainfo_out;
+
+        /*
+         * Take a failure path if out of guest allocated memory for topology.
+         * No partial copying.
+         */
+        guest_topo.nr_vnodes = d->vnuma.nr_vnodes;
+
+        if ( __copy_to_guest(guest_topo.vmemrange.h,
+                                d->vnuma.vmemrange,
+                                d->vnuma.nr_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( __copy_to_guest(guest_topo.vdistance.h,
+                                d->vnuma.vdistance,
+                                d->vnuma.nr_vnodes * d->vnuma.nr_vnodes) != 0 )
+            goto vnumainfo_out;
+
+        if ( __copy_to_guest(guest_topo.vcpu_to_vnode.h,
+                                d->vnuma.vcpu_to_vnode,
+                                d->max_vcpus) != 0 )
+            goto vnumainfo_out;
+
+        rc = 0;
+
+ vnumainfo_out:
+        if ( rc != 0 )
+            /*
+             * In case of failure to set vNUMA topology for guest,
+             * leave everything as it is, print error only. Tools will
+             * show for domain vnuma topology, but wont be seen in guest.
+             */
+            gdprintk(XENLOG_INFO, "vNUMA: failed to copy topology info to guest.\n");
+
+        rcu_unlock_domain(d);
+        break;
+    }
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 565fa4c..8b65a75 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -35,6 +35,7 @@
 #include "xen.h"
 #include "grant_table.h"
 #include "hvm/save.h"
+#include "vnuma.h"
 
 #define XEN_DOMCTL_INTERFACE_VERSION 0x0000000a
 
@@ -895,6 +896,31 @@ struct xen_domctl_cacheflush {
 typedef struct xen_domctl_cacheflush xen_domctl_cacheflush_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_cacheflush_t);
 
+/*
+ * XEN_DOMCTL_setvnumainfo: sets the vNUMA topology
+ * parameters for domain from toolstack.
+ */
+struct xen_domctl_vnuma {
+    uint32_t nr_vnodes;
+    uint32_t __pad;
+    XEN_GUEST_HANDLE_64(uint) vdistance;
+    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
+
+    /*
+     * vnodes to physical NUMA nodes mask.
+     * This kept on per-domain basis for
+     * interested consumers, such as numa aware ballooning.
+     */
+    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
+
+    /*
+     * memory rages for each vNUMA node
+     */
+    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
+};
+typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
+
 struct xen_domctl {
     uint32_t cmd;
 #define XEN_DOMCTL_createdomain                   1
@@ -965,6 +991,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_getnodeaffinity               69
 #define XEN_DOMCTL_set_max_evtchn                70
 #define XEN_DOMCTL_cacheflush                    71
+#define XEN_DOMCTL_setvnumainfo                  72
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1024,6 +1051,7 @@ struct xen_domctl {
         struct xen_domctl_cacheflush        cacheflush;
         struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
+        struct xen_domctl_vnuma             vnuma;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 2c57aa0..a7dc035 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -354,6 +354,20 @@ struct xen_pod_target {
 };
 typedef struct xen_pod_target xen_pod_target_t;
 
+/*
+ * XENMEM_get_vnuma_info used by caller to get
+ * vNUMA topology constructed for particular domain.
+ *
+ * The data exchanged is presented by vnuma_topology_info.
+ */
+#define XENMEM_get_vnuma_info               26
+
+/*
+ * XENMEM_get_vnuma_pnode used by guest to determine
+ * the physical node of the specified vnode.
+ */
+/*#define XENMEM_get_vnuma_pnode              27*/
+
 #if defined(__XEN__) || defined(__XEN_TOOLS__)
 
 #ifndef uint64_aligned_t
diff --git a/xen/include/public/vnuma.h b/xen/include/public/vnuma.h
new file mode 100644
index 0000000..ab9eda0
--- /dev/null
+++ b/xen/include/public/vnuma.h
@@ -0,0 +1,54 @@
+#ifndef _XEN_PUBLIC_VNUMA_H
+#define _XEN_PUBLIC_VNUMA_H
+
+#include "xen.h"
+
+/*
+ * Following structures are used to represent vNUMA
+ * topology to guest if requested.
+ */
+
+/*
+ * Memory ranges can be used to define
+ * vNUMA memory node boundaries by the
+ * linked list. As of now, only one range
+ * per domain is suported.
+ */
+struct vmemrange {
+    uint64_t start, end;
+};
+
+typedef struct vmemrange vmemrange_t;
+DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
+
+/*
+ * vNUMA topology specifies vNUMA node number, distance table, memory ranges and
+ * vcpu mapping provided for guests.
+ * When issuing hypercall, guest is expected to inform Xen about the memory allocated
+ * for vnuma structure through nr_vnodes and nr_vcpus.
+ */
+
+struct vnuma_topology_info {
+    /* IN */
+    domid_t domid;
+    /* IN/OUT */
+    unsigned int nr_vnodes;
+    unsigned int nr_vcpus;
+    /* OUT */
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t    _pad;
+    } vdistance;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t    _pad;
+    } vcpu_to_vnode;
+    union {
+        XEN_GUEST_HANDLE(vmemrange_t) h;
+        uint64_t    _pad;
+    } vmemrange;
+};
+typedef struct vnuma_topology_info vnuma_topology_info_t;
+DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
+
+#endif
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index bb1c398..e8b36e3 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -89,4 +89,15 @@ extern unsigned int xen_processor_pmbits;
 
 extern bool_t opt_dom0_vcpus_pin;
 
+/* vnuma_info struct to manage by Xen */
+struct vnuma_info {
+    unsigned int nr_vnodes;
+    unsigned int *vdistance;
+    unsigned int *vcpu_to_vnode;
+    unsigned int *vnode_to_pnode;
+    struct vmemrange *vmemrange;
+};
+
+int vnuma_init_zero_topology(struct domain *d);
+
 #endif /* __XEN_DOMAIN_H__ */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 44851ae..a1163fd 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,7 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+    struct vnuma_info vnuma;
 };
 
 struct domain_setup_info
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 2/8] libxc: Plumb Xen with vnuma topology
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info for debug-key Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 3/8] vnuma xl.cfg.pod and idl config options Elena Ufimtseva
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Per-domain vNUMA topology initialization.
domctl hypercall is used to set vNUMA topology
per domU during domain build time.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_domain.c |   64 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h   |   11 ++++++++
 2 files changed, 75 insertions(+)

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 369c3f3..385086a 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1814,6 +1814,70 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t domid,
     return do_domctl(xch, &domctl);
 }
 
+/* Plumbs Xen with vNUMA topology */
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
+                                    nr_vnodes * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
+                                    nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    if ( nr_vnodes == 0 ) {
+        PERROR("ZERO.\n");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( !vdistance || !vcpu_to_vnode || !vmemrange || !vnode_to_pnode ) {
+        PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n");
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
+         xc_hypercall_bounce_pre(xch, vdistance)      ||
+         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
+         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
+        PERROR("Could not bounce buffer for xc_domain_setvnuma.\n");
+        errno = EFAULT;
+        return -1;
+    }
+
+    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
+    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
+    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode);
+    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode);
+
+    domctl.cmd = XEN_DOMCTL_setvnumainfo;
+    domctl.domain = (domid_t)domid;
+    domctl.u.vnuma.nr_vnodes = nr_vnodes;
+
+    rc = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, vmemrange);
+    xc_hypercall_bounce_post(xch, vdistance);
+    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
+    xc_hypercall_bounce_post(xch, vnode_to_pnode);
+
+    if ( rc )
+        errno = EFAULT;
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 02129f7..27a42df 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -47,6 +47,8 @@
 #include <xen/xsm/flask_op.h>
 #include <xen/tmem.h>
 #include <xen/kexec.h>
+#include <xen/vnuma.h>
+
 
 #include "xentoollog.h"
 
@@ -1166,6 +1168,15 @@ int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
                                unsigned long map_limitkb);
 
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode);
+
 #if defined(__i386__) || defined(__x86_64__)
 /*
  * PC BIOS standard E820 types and structure.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 3/8] vnuma xl.cfg.pod and idl config options
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (2 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 2/8] libxc: Plumb Xen with vnuma topology Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 4/8] vnuma topology parsing routines Elena Ufimtseva
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 docs/man/xl.cfg.pod.5       |   64 ++++++++++++++++++++++++++++++++++++++++++-
 tools/libxl/libxl_types.idl |    6 +++-
 2 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index a94d037..cf98c2b 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -242,6 +242,66 @@ if the values of B<memory=> and B<maxmem=> differ.
 A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver
 it will crash.
 
+=item B<vnuma_nodes=N>
+
+Number of vNUMA nodes the guest will be initialized with on boot.
+
+=item B<vnuma_mem=[vmem1, vmem2, ...]>
+
+The vnode memory sizes defined in MBytes. If the sum of all vnode memories
+does not match the domain memory or not all the nodes defined here, will fail.
+If not specified, memory will be equally split between vnodes. Currently
+minimum vnode size is 64MB.
+
+Example: vnuma_mem=[1024, 1024, 2048, 2048]
+
+=item B<vdistance=[d1, d2]>
+
+Defines the distance table for vNUMA nodes. Distance for NUMA machines usually
+ represented by two dimensional array and all distance may be spcified in one
+line here, by rows. Distance can be specified as two numbers [d1, d2],
+where d1 is same node distance, d2 is a value for all other distances.
+If not specified, the defaul distance will be used, e.g. [10, 20].
+
+Examples:
+vnodes = 3
+vdistance=[10, 20]
+will expand to this distance table (this is default setting as well):
+[10, 20, 20]
+[20, 10, 20]
+[20, 20, 10]
+
+=item B<vnuma_vcpumap=[vcpu1, vcpu2, ...]>
+
+Defines vcpu to vnode mapping as a string of integers, representing node
+numbers. If not defined, the vcpus are interleaved over the virtual nodes.
+Current limitation: vNUMA nodes have to have at least one vcpu, otherwise
+default vcpu_to_vnode will be used.
+
+Example:
+to map 4 vcpus to 2 nodes - 0,1 vcpu -> vnode1, 2,3 vcpu -> vnode2:
+vnuma_vcpumap = [0, 0, 1, 1]
+
+=item B<vnuma_vnodemap=[p1, p2, ..., pn]>
+
+vnode to pnode mapping. Can be configured if manual vnode allocation
+required. Will be only taken into effect on real NUMA machines and if
+memory or other constraints do not prevent it. If the mapping is ok,
+automatic NUMA placement will be disabled. If the mapping incorrect
+and vnuma_autoplacement is true, automatical numa placement will be used,
+otherwise fails to create domain.
+
+Example:
+assume two node NUMA node machine:
+vnuma_vndoemap=[1, 0]
+first vnode will be placed on node 1, second on node0.
+
+=item B<vnuma_autoplacement=[0|1]>
+
+If enabled, automatically will find the best placement physical node candidate for
+each vnode if vnuma_vnodemap is incorrect or memory requirements prevent
+using it. Set to '0' by default.
+
 =back
 
 =head3 Event Actions
@@ -620,6 +680,7 @@ must be given in hexadecimal.
 It is recommended to use this option only for trusted VMs under
 administrator control.
 
+
 =item B<irqs=[ NUMBER, NUMBER, ... ]>
 
 Allow a guest to access specific physical IRQs.
@@ -701,7 +762,6 @@ it is safe to allow this to be enabled but you may wish to disable it
 anyway.
 
 =item B<pvh=BOOLEAN>
-
 Selects whether to run this PV guest in an HVM container. Default is 0.
 
 =back
@@ -944,6 +1004,7 @@ preceded by a 32b integer indicating the size of the next structure.
 
 =item B<tsc_mode="MODE">
 
+
 Specifies how the TSC (Time Stamp Counter) should be provided to the
 guest (X86 only). Specifying this option as a number is
 deprecated. Options are:
@@ -989,6 +1050,7 @@ i.e. set to UTC.
 
 Set the real time clock offset in seconds. False (0) by default.
 
+
 =item B<vpt_align=BOOLEAN>
 
 Specifies that periodic Virtual Platform Timers should be aligned to
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 52f1aa9..f9fb21e 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -313,7 +313,11 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
-    
+    ("numa_memszs",     Array(uint64, "nr_nodes")),
+    ("cpu_to_node",     Array(uint32, "nr_nodemap")),
+    ("distance",        Array(uint32, "nr_dist")),
+    ("vnode_to_pnode",  Array(uint32, "nr_node_to_pnode")),
+    ("vnuma_autoplacement",  libxl_defbool),
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
     # if you set device_model you must set device_model_version too
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 4/8] vnuma topology parsing routines
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (3 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 3/8] vnuma xl.cfg.pod and idl config options Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 5/8] libxc: allocate domain vnuma nodes Elena Ufimtseva
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Parses vnuma topoplogy number of nodes and memory
ranges. If not defined, initializes vnuma with
only one node and default topology.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_vnuma.h |   11 ++
 tools/libxl/xl_cmdimpl.c  |  406 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 417 insertions(+)
 create mode 100644 tools/libxl/libxl_vnuma.h

diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
new file mode 100644
index 0000000..f1568ae
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.h
@@ -0,0 +1,11 @@
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#define VNUMA_NO_NODE ~((unsigned int)0)
+
+/*
+ * Max vNUMA node size in Mb is taken 64Mb even now Linux lets
+ * 32Mb, thus letting some slack. Will be modified to match Linux.
+ */
+#define MIN_VNODE_SIZE  64U
+
+#define MAX_VNUMA_NODES (unsigned int)1 << 10
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 5195914..59855ed 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -40,6 +40,7 @@
 #include "libxl_json.h"
 #include "libxlutil.h"
 #include "xl.h"
+#include "libxl_vnuma.h"
 
 /* For calls which return an errno on failure */
 #define CHK_ERRNOVAL( call ) ({                                         \
@@ -725,6 +726,403 @@ static void parse_top_level_sdl_options(XLU_Config *config,
     xlu_cfg_replace_string (config, "xauthority", &sdl->xauthority, 0);
 }
 
+
+static unsigned int get_list_item_uint(XLU_ConfigList *list, unsigned int i)
+{
+    const char *buf;
+    char *ep;
+    unsigned long ul;
+    int rc = -EINVAL;
+    buf = xlu_cfg_get_listitem(list, i);
+    if (!buf)
+        return rc;
+    ul = strtoul(buf, &ep, 10);
+    if (ep == buf)
+        return rc;
+    if (ul >= UINT16_MAX)
+        return rc;
+    return (unsigned int)ul;
+}
+
+static void vdistance_set(unsigned int *vdistance,
+                                unsigned int nr_vnodes,
+                                unsigned int samenode,
+                                unsigned int othernode)
+{
+    unsigned int idx, slot;
+    for (idx = 0; idx < nr_vnodes; idx++)
+        for (slot = 0; slot < nr_vnodes; slot++)
+            *(vdistance + slot * nr_vnodes + idx) =
+                idx == slot ? samenode : othernode;
+}
+
+static void vcputovnode_default(unsigned int *cpu_to_node,
+                                unsigned int nr_vnodes,
+                                unsigned int max_vcpus)
+{
+    unsigned int cpu;
+    for (cpu = 0; cpu < max_vcpus; cpu++)
+        cpu_to_node[cpu] = cpu % nr_vnodes;
+}
+
+/* Split domain memory between vNUMA nodes equally */
+static int split_vnumamem(libxl_domain_build_info *b_info)
+{
+    unsigned long long vnodemem = 0;
+    unsigned long n;
+    unsigned int i;
+
+    /* In MBytes */
+    if (b_info->nr_nodes == 0)
+        return -1;
+    vnodemem = (b_info->max_memkb >> 10) / b_info->nr_nodes;
+    if (vnodemem < MIN_VNODE_SIZE)
+        return -1;
+    /* reminder in MBytes */
+    n = (b_info->max_memkb >> 10) % b_info->nr_nodes;
+    /* get final sizes in MBytes */
+    for (i = 0; i < (b_info->nr_nodes - 1); i++)
+        b_info->numa_memszs[i] = vnodemem;
+    /* add the reminder to the last node */
+    b_info->numa_memszs[i] = vnodemem + n;
+    return 0;
+}
+
+static void vnode_to_pnode_default(unsigned int *vnode_to_pnode,
+                                   unsigned int nr_vnodes)
+{
+    unsigned int i;
+    for (i = 0; i < nr_vnodes; i++)
+        vnode_to_pnode[i] = VNUMA_NO_NODE;
+}
+
+/*
+ * init vNUMA to "zero config" with one node and all other
+ * topology parameters set to default.
+ */
+static int vnuma_zero_config(libxl_domain_build_info *b_info)
+{
+    b_info->nr_nodes = 1;
+    /* all memory goes to this one vnode */
+    if (!(b_info->numa_memszs = (uint64_t *)calloc(b_info->nr_nodes,
+                                sizeof(*b_info->numa_memszs))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->cpu_to_node = (unsigned int *)calloc(b_info->max_vcpus,
+                                sizeof(*b_info->cpu_to_node))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->distance = (unsigned int *)calloc(b_info->nr_nodes *
+                                b_info->nr_nodes, sizeof(*b_info->distance))))
+        goto bad_vnumazerocfg;
+
+    if (!(b_info->vnode_to_pnode = (unsigned int *)calloc(b_info->nr_nodes,
+                                sizeof(*b_info->vnode_to_pnode))))
+        goto bad_vnumazerocfg;
+
+    b_info->numa_memszs[0] = b_info->max_memkb >> 10;
+
+    /* all vcpus assigned to this vnode */
+    vcputovnode_default(b_info->cpu_to_node, b_info->nr_nodes,
+                        b_info->max_vcpus);
+
+    /* default vdistance is 10 */
+    vdistance_set(b_info->distance, b_info->nr_nodes, 10, 10);
+
+    /* VNUMA_NO_NODE for vnode_to_pnode */
+    vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_nodes);
+
+    /*
+     * will be placed to some physical nodes defined by automatic
+     * numa placement or VNUMA_NO_NODE will not request exact node
+     */
+    libxl_defbool_set(&b_info->vnuma_autoplacement, true);
+    return 0;
+
+ bad_vnumazerocfg:
+    return -1;
+}
+
+/* Caller must exit */
+static void free_vnuma_info(libxl_domain_build_info *b_info)
+{
+    free(b_info->numa_memszs);
+    free(b_info->distance);
+    free(b_info->cpu_to_node);
+    free(b_info->vnode_to_pnode);
+    b_info->nr_nodes = 0;
+}
+
+/*
+static int vdistance_parse(char *vdistcfg, unsigned int *vdistance,
+                            unsigned int nr_vnodes)
+{
+    char *endptr, *toka, *tokb, *saveptra = NULL, *saveptrb = NULL;
+    unsigned int *vdist_tmp = NULL;
+    int rc = 0;
+    unsigned int i, j, parsed = 0;
+    unsigned long dist;
+
+    rc = -EINVAL;
+    if (vdistance == NULL) {
+        return rc;
+    }
+    vdist_tmp = (unsigned int *)malloc(nr_vnodes * nr_vnodes * sizeof(*vdistance));
+    if (vdist_tmp == NULL)
+        return rc;
+
+    i = j = 0;
+    for (toka = strtok_r(vdistcfg, ",", &saveptra); toka;
+        toka = strtok_r(NULL, ",", &saveptra)) {
+        if ( i >= nr_vnodes )
+            goto vdist_parse_err;
+        for (tokb = strtok_r(toka, " ", &saveptrb); tokb;
+            tokb = strtok_r(NULL, " ", &saveptrb)) {
+            if (j >= nr_vnodes)
+                goto vdist_parse_err;
+            dist = strtol(tokb, &endptr, 10);
+            if (dist > UINT16_MAX || dist < 0)
+                goto vdist_parse_err;
+            if (tokb == endptr)
+                goto vdist_parse_err;
+            *(vdist_tmp + j*nr_vnodes + i) = dist;
+            parsed++;
+            j++;
+        }
+        i++;
+        j = 0;
+    }
+    rc = parsed;
+    memcpy(vdistance, vdist_tmp, nr_vnodes * nr_vnodes * sizeof(*vdistance));
+
+ vdist_parse_err:
+    free(vdist_tmp);
+    return rc;
+}
+*/
+
+static void parse_vnuma_config(XLU_Config *config, libxl_domain_build_info *b_info)
+{
+    XLU_ConfigList *vnumamemcfg;
+    XLU_ConfigList *vdistancecfg, *vnodemap, *vcpumap;
+    int nr_vnuma_regions;
+    int nr_vdist, nr_vnodemap, nr_vcpumap, i;
+    unsigned long long vnuma_memparsed = 0;
+    long l;
+    unsigned long ul;
+    const char *buf;
+
+    if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) {
+        if (l > MAX_VNUMA_NODES) {
+            fprintf(stderr, "Too many vnuma nodes, max %d is allowed.\n", MAX_VNUMA_NODES);
+            goto bad_vnuma_config;
+        }
+        b_info->nr_nodes = l;
+
+        xlu_cfg_get_defbool(config, "vnuma_autoplacement", &b_info->vnuma_autoplacement, 0);
+
+        /* Only construct nodes with at least one vcpu for now */
+        if (b_info->nr_nodes != 0 && b_info->max_vcpus >= b_info->nr_nodes) {
+            if (!xlu_cfg_get_list(config, "vnumamem",
+                                  &vnumamemcfg, &nr_vnuma_regions, 0)) {
+
+                if (nr_vnuma_regions != b_info->nr_nodes) {
+                    fprintf(stderr, "Number of numa regions (vnumamem = %d) is incorrect (should be %d).\n",
+                            nr_vnuma_regions, b_info->nr_nodes);
+                    goto bad_vnuma_config;
+                }
+
+                b_info->numa_memszs = calloc(b_info->nr_nodes,
+                                              sizeof(*b_info->numa_memszs));
+                if (b_info->numa_memszs == NULL) {
+                    fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
+                    goto bad_vnuma_config;
+                }
+
+                char *ep;
+                /*
+                 * Will parse only nr_vnodes times, even if we have more/less regions.
+                 * Take care of it later if less or discard if too many regions.
+                 */
+                for (i = 0; i < b_info->nr_nodes; i++) {
+                    buf = xlu_cfg_get_listitem(vnumamemcfg, i);
+                    if (!buf) {
+                        fprintf(stderr,
+                                "xl: Unable to get element %d in vnuma memory list.\n", i);
+                        break;
+                    }
+                    ul = strtoul(buf, &ep, 10);
+                    if (ep == buf) {
+                        fprintf(stderr,
+                                "xl: Invalid argument parsing vnumamem: %s.\n", buf);
+                        break;
+                    }
+
+                    /* 32Mb is a min size for a node, taken from Linux */
+                    if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) {
+                        fprintf(stderr, "xl: vnuma memory %lu is not within %u - %u range.\n",
+                                ul, MIN_VNODE_SIZE, UINT32_MAX);
+                        break;
+                    }
+
+                    /* memory in MBytes */
+                    b_info->numa_memszs[i] = ul;
+                }
+
+                /* Total memory for vNUMA parsed to verify */
+                for (i = 0; i < nr_vnuma_regions; i++)
+                    vnuma_memparsed = vnuma_memparsed + (b_info->numa_memszs[i]);
+
+                /* Amount of memory for vnodes same as total? */
+                if ((vnuma_memparsed << 10) != (b_info->max_memkb)) {
+                    fprintf(stderr, "xl: vnuma memory is not the same as domain memory size.\n");
+                    goto bad_vnuma_config;
+                }
+            } else {
+                b_info->numa_memszs = calloc(b_info->nr_nodes,
+                                              sizeof(*b_info->numa_memszs));
+                if (b_info->numa_memszs == NULL) {
+                    fprintf(stderr, "Unable to allocate memory for vnuma ranges.\n");
+                    goto bad_vnuma_config;
+                }
+
+                fprintf(stderr, "WARNING: vNUMA memory ranges were not specified.\n");
+                fprintf(stderr, "Using default equal vnode memory size %lu Kbytes to cover %lu Kbytes.\n",
+                                b_info->max_memkb / b_info->nr_nodes, b_info->max_memkb);
+
+                if (split_vnumamem(b_info) < 0) {
+                    fprintf(stderr, "Could not split vnuma memory into equal chunks.\n");
+                    goto bad_vnuma_config;
+                }
+            }
+
+            b_info->distance = calloc(b_info->nr_nodes * b_info->nr_nodes,
+                                       sizeof(*b_info->distance));
+            if (b_info->distance == NULL)
+                goto bad_vnuma_config;
+
+            if (!xlu_cfg_get_list(config, "vdistance", &vdistancecfg, &nr_vdist, 0)) {
+                int d1, d2;
+                /*
+                 * First value is the same node distance, the second as the
+                 * rest of distances. The following is required right now to
+                 * avoid non-symmetrical distance table as it may break latest kernel.
+                 * TODO: Better way to analyze extended distance table, possibly
+                 * OS specific.
+                 */
+                 d1 = get_list_item_uint(vdistancecfg, 0);
+                 d2 = get_list_item_uint(vdistancecfg, 1);
+
+                 if (d1 >= 0 && d2 >= 0 && d1 < d2) {
+                    vdistance_set(b_info->distance, b_info->nr_nodes, d1, d2);
+                 } else {
+                    fprintf(stderr, "WARNING: vnuma distance values are incorrect.\n");
+                    goto bad_vnuma_config;
+                 }
+
+            } else {
+                fprintf(stderr, "Could not parse vnuma distances.\n");
+                vdistance_set(b_info->distance, b_info->nr_nodes, 10, 20);
+            }
+
+            b_info->cpu_to_node = (unsigned int *)calloc(b_info->max_vcpus,
+                                     sizeof(*b_info->cpu_to_node));
+            if (b_info->cpu_to_node == NULL)
+                goto bad_vnuma_config;
+
+            if (!xlu_cfg_get_list(config, "numa_cpumask",
+                                  &vcpumap, &nr_vcpumap, 0)) {
+                if (nr_vcpumap == b_info->max_vcpus) {
+                    unsigned int  vnode, vcpumask = 0, vmask;
+                    vmask = ~(~0 << nr_vcpumap);
+                    for (i = 0; i < nr_vcpumap; i++) {
+                        vnode = get_list_item_uint(vcpumap, i);
+                        if (vnode >= 0 && vnode < b_info->nr_nodes) {
+                            vcpumask  |= (1 << i);
+                            b_info->cpu_to_node[i] = vnode;
+                        }
+                    }
+
+                    /* Did it covered all vnodes in the vcpu mask? */
+                    if ( !(((vmask & vcpumask) + 1) == (1 << nr_vcpumap)) ) {
+                        fprintf(stderr, "WARNING: Not all vnodes were covered in numa_cpumask.\n");
+                        goto bad_vnuma_config;
+                    }
+                } else {
+                    fprintf(stderr, "WARNING:  Bad vnuma_vcpumap.\n");
+                    goto bad_vnuma_config;
+                }
+            }
+            else
+                vcputovnode_default(b_info->cpu_to_node,
+                                    b_info->nr_nodes,
+                                    b_info->max_vcpus);
+
+            /* There is mapping to NUMA physical nodes? */
+            b_info->vnode_to_pnode = (unsigned int *)calloc(b_info->nr_nodes,
+                                            sizeof(*b_info->vnode_to_pnode));
+            if (b_info->vnode_to_pnode == NULL)
+                goto bad_vnuma_config;
+            if (!xlu_cfg_get_list(config, "vnuma_vnodemap",&vnodemap,
+                                                    &nr_vnodemap, 0)) {
+                /*
+                * If not specified or incorred, will be defined
+                * later based on the machine architecture, configuration
+                * and memory availble when creating domain.
+                */
+                if (nr_vnodemap == b_info->nr_nodes) {
+                    unsigned int vnodemask = 0, pnode, smask;
+                    smask = ~(~0 << b_info->nr_nodes);
+                    for (i = 0; i < b_info->nr_nodes; i++) {
+                        pnode = get_list_item_uint(vnodemap, i);
+                        if (pnode >= 0) {
+                            vnodemask |= (1 << i);
+                            b_info->vnode_to_pnode[i] = pnode;
+                        }
+                    }
+
+                    /* Did it covered all vnodes in the mask? */
+                    if ( !(((vnodemask & smask) + 1) == (1 << nr_vnodemap)) ) {
+                        fprintf(stderr, "WARNING: Not all vnodes were covered vnuma_vnodemap.\n");
+
+                        if (libxl_defbool_val(b_info->vnuma_autoplacement)) {
+                            fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+                            vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_nodes);
+                        } else
+                            goto bad_vnuma_config;
+                    }
+                } else {
+                    fprintf(stderr, "WARNING: Incorrect vnuma_vnodemap.\n");
+
+                    if (libxl_defbool_val(b_info->vnuma_autoplacement)) {
+                        fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+                        vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_nodes);
+                    } else
+                        goto bad_vnuma_config;
+                }
+            } else {
+                fprintf(stderr, "WARNING: Missing vnuma_vnodemap.\n");
+
+                if (libxl_defbool_val(b_info->vnuma_autoplacement)) {
+                    fprintf(stderr, "Automatic placement will be used for vnodes.\n");
+                    vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_nodes);
+                } else
+                    goto bad_vnuma_config;
+            }
+        }
+        else if (vnuma_zero_config(b_info))
+            goto bad_vnuma_config;
+    }
+    /* If vnuma topology is not defined for domain, init one node */
+    else if (vnuma_zero_config(b_info))
+            goto bad_vnuma_config;
+    return;
+
+ bad_vnuma_config:
+    free_vnuma_info(b_info);
+    exit(1);
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -1081,6 +1479,14 @@ static void parse_config_data(const char *config_source,
             exit(1);
         }
 
+        libxl_defbool_set(&b_info->vnuma_autoplacement, false);
+
+        /*
+         * If there is no vnuma in config, "zero" vnuma config
+         * will be initialized with one node and other defaults.
+         */
+        parse_vnuma_config(config, b_info);
+
         xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0);
         switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args",
                                       &b_info->u.pv.bootloader_args, 1))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 5/8] libxc: allocate domain vnuma nodes
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (4 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 4/8] vnuma topology parsing routines Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 6/8] libxl: build e820 map for vnodes Elena Ufimtseva
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

vnuma-aware domain memory allocation based on built
vnode to pnode mask.
Every pv domain has at least one vnuma node
and the vnode to pnode will be taken into account.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_dom.h     |   10 +++++++
 tools/libxc/xc_dom_x86.c |   69 ++++++++++++++++++++++++++++++++++++++--------
 tools/libxc/xg_private.h |    1 +
 3 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index c9af0ce..e628a0e 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -122,6 +122,15 @@ struct xc_dom_image {
     struct xc_dom_phys *phys_pages;
     int realmodearea_log;
 
+    /*
+     * vNUMA topology and memory allocation structure.
+     * Defines the way to allocate memory on per NUMA
+     * physical nodes that is defined by vnode_to_pnode.
+     */
+    uint32_t nr_nodes;
+    uint64_t *numa_memszs;
+    unsigned int *vnode_to_pnode;
+
     /* malloc memory pool */
     struct xc_dom_mem *memblocks;
 
@@ -377,6 +386,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image *dom,
 int arch_setup_meminit(struct xc_dom_image *dom);
 int arch_setup_bootearly(struct xc_dom_image *dom);
 int arch_setup_bootlate(struct xc_dom_image *dom);
+int arch_boot_numa_alloc(struct xc_dom_image *dom);
 
 /*
  * Local variables:
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index e034d62..1992dfd 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -759,7 +759,7 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
-    xen_pfn_t pfn, allocsz, i, j, mfn;
+    xen_pfn_t pfn, i, j, mfn;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -802,34 +802,79 @@ int arch_setup_meminit(struct xc_dom_image *dom)
     else
     {
         /* try to claim pages for early warning of insufficient memory avail */
+        rc = 0;
         if ( dom->claim_enabled ) {
             rc = xc_domain_claim_pages(dom->xch, dom->guest_domid,
                                        dom->total_pages);
             if ( rc )
+            {
+                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                             "%s: Failed to claim mem for dom\n",
+                             __FUNCTION__);
                 return rc;
+            }
         }
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
         
         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
-        {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
-        }
+        rc = arch_boot_numa_alloc(dom);
+        if ( rc )
+            return rc;
 
         /* Ensure no unclaimed pages are left unused.
          * OK to call if hadn't done the earlier claim call. */
         (void)xc_domain_claim_pages(dom->xch, dom->guest_domid,
                                     0 /* cancels the claim */);
     }
+    return rc;
+}
+
+/*
+ * Any pv guest will have at least one vnuma node
+ * with vnuma_memszs[0] = domain memory and the rest
+ * topology initialized with default values.
+ */
+int arch_boot_numa_alloc(struct xc_dom_image *dom)
+{
+    int rc;
+    unsigned int n, memflags;
+    unsigned long long vnode_pages;
+    unsigned long long allocsz = 0, node_pfn_base, i;
+
+    rc = allocsz = node_pfn_base = 0;
+
+    allocsz = 0;
+    for ( n = 0; n < dom->nr_nodes; n++ )
+    {
+        memflags = 0;
+        if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE )
+        {
+            memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]);
+            memflags |= XENMEMF_exact_node_request;
+        }
+        vnode_pages = (dom->numa_memszs[n] << 20) >> PAGE_SHIFT_X86;
+        for ( i = 0;
+            (i < vnode_pages) && !rc;
+                i += allocsz )
+        {
+            allocsz = vnode_pages - i;
+            if ( allocsz > 1024*1024 )
+                allocsz = 1024*1024;
+                rc = xc_domain_populate_physmap_exact(
+                                    dom->xch, dom->guest_domid, allocsz,
+                                    0, memflags, &dom->p2m_host[node_pfn_base + i]);
+        }
+        if ( rc )
+        {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                    "%s: Failed allocation of %Lu pages for vnode %d on pnode %d out of %lu\n",
+                    __FUNCTION__, vnode_pages, n, dom->vnode_to_pnode[n], dom->total_pages);
+            return rc;
+        }
+        node_pfn_base += i;
+    }
 
     return rc;
 }
diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
index f5755fd..15ee876 100644
--- a/tools/libxc/xg_private.h
+++ b/tools/libxc/xg_private.h
@@ -123,6 +123,7 @@ typedef uint64_t l4_pgentry_64_t;
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
 #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
 
+#define VNUMA_NO_NODE ~((unsigned int)0)
 
 /* XXX SMH: following skanky macros rely on variable p2m_size being set */
 /* XXX TJD: also, "guest_width" should be the guest's sizeof(unsigned long) */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 6/8] libxl: build e820 map for vnodes
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (5 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 5/8] libxc: allocate domain vnuma nodes Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 7/8] libxl: place vnuma domain nodes on numa nodes Elena Ufimtseva
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

build e820 map from vnuma memory ranges.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_internal.h |   10 ++++
 tools/libxl/libxl_numa.c     |  125 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 082749e..7ae8508 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3113,6 +3113,16 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
  */
 #define CTYPE(isfoo,c) (isfoo((unsigned char)(c)))
 
+int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+                         uint32_t *nr_entries,
+                         unsigned long map_limitkb,
+                         unsigned long balloon_kb);
+
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                            uint32_t domid,
+                            struct libxl_domain_build_info *b_info,
+                            vmemrange_t *memblks);
+
 
 #endif
 
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 94ca4fe..38f1546 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -19,6 +19,8 @@
 
 #include "libxl_internal.h"
 
+#include "libxl_vnuma.h"
+
 /*
  * What follows are helpers for generating all the k-combinations
  * without repetitions of a set S with n elements in it. Formally
@@ -508,6 +510,129 @@ int libxl__get_numa_candidate(libxl__gc *gc,
 }
 
 /*
+/*
+ * Used for PV guest with e802_host enabled and thus
+ * having non-contiguous e820 memory map.
+ */
+static unsigned long e820_memory_hole_size(unsigned long start,
+                                            unsigned long end,
+                                            struct e820entry e820[],
+                                            unsigned int nr)
+{
+    unsigned int i;
+    unsigned long absent, start_pfn, end_pfn;
+
+    absent = end - start;
+    for (i = 0; i < nr; i++) {
+        /* if not E820_RAM region, skip it and dont substract from absent */
+        if (e820[i].type == E820_RAM) {
+            start_pfn = e820[i].addr;
+            end_pfn =   e820[i].addr + e820[i].size;
+            /* beginning pfn is in this region? */
+            if (start >= start_pfn && start <= end_pfn) {
+                if (end > end_pfn)
+                    absent -= end_pfn - start;
+                else
+                    /* fit the region? then no absent pages */
+                    absent -= end - start;
+                continue;
+            }
+            /* found the end of range in this region? */
+            if (end <= end_pfn && end >= start_pfn) {
+                absent -= end - start_pfn;
+                /* no need to look for more ranges */
+                break;
+            }
+        }
+    }
+    return absent;
+}
+
+/*
+ * Checks for the beginnig and end of RAM in e820 map for domain
+ * and aligns start of first and end of last vNUMA memory block to
+ * that map. vnode memory size are passed here Megabytes.
+ * For PV guest e820 map has fixed hole sizes.
+ */
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                            uint32_t domid,
+                            libxl_domain_build_info *b_info, /* IN: mem sizes */
+                            vmemrange_t *memblks)        /* OUT: linux numa blocks in pfn */
+{
+    unsigned int i, j;
+    int rc;
+    uint64_t next_start_pfn, end_max = 0, size;//, mem_hole;
+    uint32_t nr;
+    struct e820entry map[E820MAX];
+
+    if (b_info->nr_nodes == 0)
+        return -EINVAL;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    /* retreive e820 map for this host */
+    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
+
+    if (rc < 0) {
+        errno = rc;
+        return -EINVAL;
+    }
+    nr = rc;
+    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    if (rc)
+    {
+        errno = rc;
+        return -EINVAL;
+    }
+
+    /* max pfn for this host */
+    for (j = nr - 1; j >= 0; j--)
+        if (map[j].type == E820_RAM) {
+            end_max = map[j].addr + map[j].size;
+            break;
+        }
+
+    memset(memblks, 0, sizeof(*memblks) * b_info->nr_nodes);
+    next_start_pfn = 0;
+
+    memblks[0].start = map[0].addr;
+
+    for (i = 0; i < b_info->nr_nodes; i++) {
+        /* start can be not zero */
+        memblks[i].start += next_start_pfn;
+        memblks[i].end = memblks[i].start + (b_info->numa_memszs[i] << 20);
+
+        size = memblks[i].end - memblks[i].start;
+        /*
+         * For pv host with e820_host option turned on we need
+         * to take into account memory holes. For pv host with
+         * e820_host disabled or unset, the map is contiguous
+         * RAM region.
+         */
+        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
+            while((memblks[i].end - memblks[i].start -
+                   e820_memory_hole_size(memblks[i].start,
+                   memblks[i].end, map, nr)) < size )
+            {
+                memblks[i].end += MIN_VNODE_SIZE << 10;
+                if (memblks[i].end > end_max) {
+                    memblks[i].end = end_max;
+                    break;
+                }
+            }
+        }
+        next_start_pfn = memblks[i].end;
+        LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,"i %d, start  = %#010lx, end = %#010lx\n",
+                    i, memblks[i].start, memblks[i].end);
+    }
+    if (memblks[i-1].end > end_max)
+        memblks[i-1].end = end_max;
+
+    return 0;
+}
+
+/*
  * Local variables:
  * mode: C
  * c-basic-offset: 4
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 7/8] libxl: place vnuma domain nodes on numa nodes
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (6 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 6/8] libxl: build e820 map for vnodes Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info out on debug-key Elena Ufimtseva
  2014-06-03 11:37 ` [PATCH v5 0/8] vnuma introduction Wei Liu
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Automatic numa placement cancels manual vnode placement
mechanism. If numa placement explicitly specified, try
to fit vnodes to the physical nodes.
This can be changed if needed, but looks like this variant
not too confusing.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl.c          |   22 ++++++++
 tools/libxl/libxl.h          |   12 ++++
 tools/libxl/libxl_arch.h     |    4 ++
 tools/libxl/libxl_dom.c      |  126 +++++++++++++++++++++++++++++++++++++++++-
 tools/libxl/libxl_internal.h |    3 +
 tools/libxl/libxl_numa.c     |   44 +++++++++++++++
 tools/libxl/libxl_x86.c      |    3 +-
 7 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 900b8d4..4034a63 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -4701,6 +4701,28 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid,
     return 0;
 }
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                            uint32_t domid,
+                            uint16_t nr_vnodes,
+                            uint16_t nr_vcpus,
+                            vmemrange_t *vmemrange,
+                            unsigned int *vdistance,
+                            unsigned int *vcpu_to_vnode,
+                            unsigned int *vnode_to_pnode)
+{
+    int ret;
+    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
+                                nr_vcpus, vmemrange,
+                                vdistance,
+                                vcpu_to_vnode,
+                                vnode_to_pnode);
+    if (ret < 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx");
+        return ERROR_FAIL;
+    }
+    return ret;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 80947c3..f7082ff 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -308,11 +308,14 @@
 #include <netinet/in.h>
 #include <sys/wait.h> /* for pid_t */
 
+#include <xen/memory.h>
 #include <xentoollog.h>
 
 #include <libxl_uuid.h>
 #include <_libxl_list.h>
 
+#include <xen/vnuma.h>
+
 /* API compatibility. */
 #ifdef LIBXL_API_VERSION
 #if LIBXL_API_VERSION != 0x040200 && LIBXL_API_VERSION != 0x040300 && \
@@ -856,6 +859,15 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus);
 void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms);
 void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms);
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                           uint32_t domid,
+                           uint16_t nr_vnodes,
+                           uint16_t nr_vcpus,
+                           vmemrange_t *vmemrange,
+                           unsigned int *vdistance,
+                           unsigned int *vcpu_to_vnode,
+                           unsigned int *vnode_to_pnode);
+
 /*
  * Devices
  * =======
diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
index d3bc136..004ec18 100644
--- a/tools/libxl/libxl_arch.h
+++ b/tools/libxl/libxl_arch.h
@@ -27,4 +27,8 @@ int libxl__arch_domain_init_hw_description(libxl__gc *gc,
 int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
                                       libxl_domain_build_info *info,
                                       struct xc_dom_image *dom);
+int libxl__arch_domain_configure(libxl__gc *gc,
+                                 libxl_domain_build_info *info,
+                                 struct xc_dom_image *dom);
+
 #endif
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 661999c..bf922ab 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -23,6 +23,7 @@
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
 #include <xen/hvm/hvm_xs_strings.h>
+#include <libxl_vnuma.h>
 
 libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid)
 {
@@ -227,6 +228,60 @@ static void hvm_set_conf_params(xc_interface *handle, uint32_t domid,
                     libxl_defbool_val(info->u.hvm.nested_hvm));
 }
 
+/* sets vnode_to_pnode map */
+static int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
+                        libxl_domain_build_info *info)
+{
+    unsigned int i, n;
+    int nr_nodes = 0;
+    uint64_t *vnodes_mem;
+    unsigned long long *nodes_claim = NULL;
+    libxl_numainfo *ninfo = NULL;
+
+    if (info->vnode_to_pnode == NULL) {
+        info->vnode_to_pnode = libxl__calloc(gc, info->nr_nodes,
+                                      sizeof(*info->vnode_to_pnode));
+    }
+
+    /* default setting */
+    for (i = 0; i < info->nr_nodes; i++)
+        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
+
+    /* Get NUMA info */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+    /* Nothing to see if only one NUMA node */
+    if (nr_nodes <= 1)
+        return 0;
+
+    vnodes_mem = info->numa_memszs;
+    /*
+     * TODO: change algorithm. The current just fits the nodes
+     * Will be nice to have them also sorted by size
+     * If no p-node found, will be set to NUMA_NO_NODE
+     */
+    nodes_claim = libxl__calloc(gc, info->nr_nodes, sizeof(*nodes_claim));
+    if ( !nodes_claim )
+        return ERROR_FAIL;
+
+    libxl_for_each_set_bit(n, info->nodemap)
+    {
+        for (i = 0; i < info->nr_nodes; i++)
+        {
+            if (((nodes_claim[n] + (vnodes_mem[i] << 20)) <= ninfo[n].free) &&
+                 /*vnode was not set yet */
+                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
+            {
+                info->vnode_to_pnode[i] = n;
+                nodes_claim[n] += (vnodes_mem[i] << 20);
+            }
+        }
+    }
+
+    return 0;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_config *d_config, libxl__domain_build_state *state)
 {
@@ -240,6 +295,22 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         return ERROR_FAIL;
     }
 
+    /* The memory blocks will be formed here from sizes */
+    struct vmemrange *memrange = libxl__calloc(gc, info->nr_nodes,
+                                            sizeof(*memrange));
+
+    if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
+        LOG(DETAIL, "Failed to align memory map.\n");
+        return ERROR_FAIL;
+    }
+
+    /* numa_placement and vnuma_autoplacement handling:
+     * If numa_placement is set to default, do not use vnode to pnode
+     * mapping as automatic placement algorithm will find best numa nodes.
+     * If numa_placement is not used, we can try and use domain vnode
+     * to pnode mask.
+     */
+    if (libxl_defbool_val(info->numa_placement)) {
     /*
      * Check if the domain has any CPU affinity. If not, try to build
      * up one. In case numa_place_domain() find at least a suitable
@@ -249,7 +320,6 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
      * libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
-    if (libxl_defbool_val(info->numa_placement)) {
         if (!libxl_bitmap_is_full(&info->cpumap)) {
             LOG(ERROR, "Can run NUMA placement only if no vcpu "
                        "affinity is specified");
@@ -259,7 +329,40 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         rc = numa_place_domain(gc, domid, info);
         if (rc)
             return rc;
+
+        /* If vnode_to_pnode mask was defined, dont use it if we automatically
+         * place domain on NUMA nodes, just give warning.
+         */
+        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
+            LOG(INFO, "Automatic NUMA placement for domain is turned on. \
+                vnode to physical nodes mapping will not be used.");
+        }
+        if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
+            LOG(ERROR, "Failed to build vnode to pnode map\n");
+            return ERROR_FAIL;
+        }
+    } else {
+        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
+                if (!libxl__vnodemap_is_usable(gc, info)) {
+                    LOG(ERROR, "Defined vnode to pnode domain map cannot be used.\n");
+                    return ERROR_FAIL;
+                }
+        } else {
+            if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
+                LOG(ERROR, "Failed to build vnode to pnode map.\n");
+                return ERROR_FAIL;
+            }
+        }
+    }
+
+    if (xc_domain_setvnuma(ctx->xch, domid, info->nr_nodes,
+                            info->max_vcpus, memrange,
+                            info->distance, info->cpu_to_node,
+                            info->vnode_to_pnode) < 0) {
+       LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Failed to set vnuma topology for domain from\n.");
+       return ERROR_FAIL;
     }
+
     libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
 
@@ -422,6 +525,26 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
 
+    dom->vnode_to_pnode = (unsigned int *)malloc(
+                            info->nr_nodes * sizeof(*info->vnode_to_pnode));
+    dom->numa_memszs = (uint64_t *)malloc(
+                          info->nr_nodes * sizeof(*info->numa_memszs));
+
+    if ( dom->numa_memszs == NULL || dom->vnode_to_pnode == NULL ) {
+        info->nr_nodes = 0;
+        if (dom->vnode_to_pnode) free(dom->vnode_to_pnode);
+        if (dom->numa_memszs) free(dom->numa_memszs);
+        LOGE(ERROR, "Failed to allocate memory for vnuma");
+        goto out;
+    }
+
+    memcpy(dom->numa_memszs, info->numa_memszs,
+            sizeof(*info->numa_memszs) * info->nr_nodes);
+    memcpy(dom->vnode_to_pnode, info->vnode_to_pnode,
+            sizeof(*info->vnode_to_pnode) * info->nr_nodes);
+
+    dom->nr_nodes = info->nr_nodes;
+
     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
         goto out;
@@ -432,6 +555,7 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
         goto out;
     }
 #endif
+
     if ( (ret = xc_dom_parse_image(dom)) != 0 ) {
         LOGE(ERROR, "xc_dom_parse_image failed");
         goto out;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 7ae8508..9a71fd9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3113,6 +3113,9 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
  */
 #define CTYPE(isfoo,c) (isfoo((unsigned char)(c)))
 
+unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
+                                libxl_domain_build_info *info);
+
 int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
                          uint32_t *nr_entries,
                          unsigned long map_limitkb,
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 38f1546..7814a63 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -510,6 +510,50 @@ int libxl__get_numa_candidate(libxl__gc *gc,
 }
 
 /*
+ * Check if we can fit vnuma nodes to numa pnodes
+ * from vnode_to_pnode mask.
+ */
+unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info)
+{
+    unsigned int i;
+    libxl_numainfo *ninfo = NULL;
+    unsigned long long *claim;
+    unsigned int node;
+    uint64_t *mems;
+    int rc, nr_nodes;
+
+    rc = nr_nodes = 0;
+
+    /*
+     * Cannot use specified mapping if not NUMA machine
+     */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return rc;
+
+    mems = info->numa_memszs;
+    claim = libxl__calloc(gc, info->nr_nodes, sizeof(*claim));
+    /* Get total memory required on each physical node */
+    for (i = 0; i < info->nr_nodes; i++)
+    {
+        node = info->vnode_to_pnode[i];
+        /* Correct pnode number? */
+        if (node < nr_nodes)
+            claim[node] += (mems[i] << 20);
+        else
+            goto vnodemapout;
+   }
+   for (i = 0; i < nr_nodes; i++) {
+       if (claim[i] > ninfo[i].free)
+          /* Cannot complete user request, falling to default */
+          goto vnodemapout;
+   }
+   rc = 1;
+
+ vnodemapout:
+   return rc;
+}
+
 /*
  * Used for PV guest with e802_host enabled and thus
  * having non-contiguous e820 memory map.
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 7589060..46e84e4 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -1,5 +1,6 @@
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_vnuma.h"
 
 static const char *e820_names(int type)
 {
@@ -14,7 +15,7 @@ static const char *e820_names(int type)
     return "Unknown";
 }
 
-static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
                          uint32_t *nr_entries,
                          unsigned long map_limitkb,
                          unsigned long balloon_kb)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 8/8] add vnuma info out on debug-key
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (7 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 7/8] libxl: place vnuma domain nodes on numa nodes Elena Ufimtseva
@ 2014-06-03  4:53 ` Elena Ufimtseva
  2014-06-03 11:37 ` [PATCH v5 0/8] vnuma introduction Wei Liu
  9 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-03  4:53 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, JBeulich,
	Elena Ufimtseva

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/arch/x86/numa.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index b141877..8377492 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
 	s_time_t now = NOW();
-	int i;
+	int i, j, n, err;
 	struct domain *d;
 	struct page_info *page;
+	char tmp[12];
 	unsigned int page_num_node[MAX_NUMNODES];
 
 	printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
@@ -389,6 +390,32 @@ static void dump_numa(unsigned char key)
 
 		for_each_online_node(i)
 			printk("    Node %u: %u\n", i, page_num_node[i]);
+        printk("    Domain has %u vnodes\n", d->vnuma.nr_vnodes);
+
+		for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) {
+			err = snprintf(tmp, 12, "%u", d->vnuma.vnode_to_pnode[i]);
+			if ( err < 0 )
+				printk("        vnode %u - pnode %s,", i, "any");
+			else
+				printk("        vnode %u - pnode %s,", i,
+			d->vnuma.vnode_to_pnode[i] == NUMA_NO_NODE ? "any" : tmp);
+			printk(" %"PRIu64" MB, ",
+				(d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20);
+			printk("vcpus: ");
+
+			for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
+				if ( d->vnuma.vcpu_to_vnode[j] == i ) {
+					if ( !((n + 1) % 8) )
+						printk("%u\n", j);
+					else if ( !(n % 8) && n != 0 )
+							printk("%s%u ", "             ", j);
+						else
+							printk("%u ", j);
+					n++;
+				}
+			}
+			printk("\n");
+		}
 	}
 
 	rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls
  2014-06-03  4:53 ` [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls Elena Ufimtseva
@ 2014-06-03  8:55   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2014-06-03  8:55 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel

>>> On 03.06.14 at 06:53, <ufimtseva@gmail.com> wrote:
> +int vnuma_init_zero_topology(struct domain *d)
> +{
> +    d->vnuma.vmemrange[0].end = d->vnuma.vmemrange[d->vnuma.nr_vnodes - 1].end;
> +    d->vnuma.vdistance[0] = 10;
> +    memset(d->vnuma.vnode_to_pnode, NUMA_NO_NODE, d->vnuma.nr_vnodes);

Here you're relying on specific characteristics of NUMA_NO_NODE; you
shouldn't be using memset() in cases like this.

> +    memset(d->vnuma.vcpu_to_vnode, 0, d->max_vcpus);
> +    d->vnuma.nr_vnodes = 1;

Also, considering this, is there any point at all in setting any but the
first array element of d->vnuma.vnode_to_pnode[]?

> +    return 0;
> +}

And in the end the function has no caller anyway, so one can't even
judge whether e.g. d->vnuma.nr_vnodes is reliably > 0 on entry.

> @@ -888,6 +889,89 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        unsigned int dist_size, nr_vnodes;
> +
> +        ret = -EINVAL;
> +
> +        /* If number of vnodes was set before, skip */
> +        if ( d->vnuma.nr_vnodes > 0 )
> +            break;
> +
> +        nr_vnodes = op->u.vnuma.nr_vnodes;
> +        if ( nr_vnodes == 0 )
> +            goto setvnumainfo_out;
> +
> +        if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
> +            goto setvnumainfo_out;
> +
> +        ret = -EFAULT;
> +        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
> +             guest_handle_is_null(op->u.vnuma.vmemrange)     ||
> +             guest_handle_is_null(op->u.vnuma.vcpu_to_vnode) ||
> +             guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
> +            goto setvnumainfo_out;
> +
> +        dist_size = nr_vnodes * nr_vnodes;
> +
> +        d->vnuma.vdistance = xmalloc_array(unsigned int, dist_size);

I think XSA-77 was issued between the previous iteration of this series
and this one. With that advisory in mind the question here is - how big
of an allocation can this become? I.e. I'm afraid the restriction enforced
above on nr_vnodes isn't enough. A reasonable thing to do might be
to make sure none of the allocation sizes would exceed PAGE_SIZE, at
least for now (any bigger allocation - if needed in the future - would
then need to be split).

> +        d->vnuma.vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
> +        d->vnuma.vcpu_to_vnode = xmalloc_array(unsigned int, d->max_vcpus);

While not an unbounded allocation, I assume you realize that the
current worst case is an order-3 page allocation here, which is
prone to fail after reasonably long uptime of a system. It may
nevertheless be okay for starters, since only domains with more
than 1k vCPU-s would be affected, and that seems tolerable for
the moment.

> +        d->vnuma.vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
> +
> +        if ( d->vnuma.vdistance == NULL ||
> +             d->vnuma.vmemrange == NULL ||
> +             d->vnuma.vcpu_to_vnode == NULL ||
> +             d->vnuma.vnode_to_pnode == NULL )
> +        {
> +            ret = -ENOMEM;
> +            goto setvnumainfo_nomem;

This is the only use of this label, i.e. the error handling code could be
moved right here. Also, if you use goto here, then please be
consistent about where you set "ret".

> +        }
> +
> +        if ( unlikely(__copy_from_guest(d->vnuma.vdistance,
> +                                    op->u.vnuma.vdistance,
> +                                    dist_size)) )
> +            goto setvnumainfo_out;
> +        if ( unlikely(__copy_from_guest(d->vnuma.vmemrange,
> +                                    op->u.vnuma.vmemrange,
> +                                    nr_vnodes)) )
> +            goto setvnumainfo_out;
> +        if ( unlikely(__copy_from_guest(d->vnuma.vcpu_to_vnode,
> +                                    op->u.vnuma.vcpu_to_vnode,
> +                                    d->max_vcpus)) )
> +            goto setvnumainfo_out;
> +        if ( unlikely(__copy_from_guest(d->vnuma.vnode_to_pnode,
> +                                    op->u.vnuma.vnode_to_pnode,
> +                                    nr_vnodes)) )
> +            goto setvnumainfo_out;

I'm relatively certain I commented on this earlier on: None of these
__copy_from_guest() uses are legitimate, you always need
copy_from_guest() here.

> +
> +        /* Everything is good, lets set the number of vnodes */
> +        d->vnuma.nr_vnodes = nr_vnodes;
> +
> +        ret = 0;
> +        break;
> +
> + setvnumainfo_out:
> +        /* On failure, set one vNUMA node */
> +        d->vnuma.vmemrange[0].end = d->vnuma.vmemrange[d->vnuma.nr_vnodes - 1].end;
> +        d->vnuma.vdistance[0] = 10;
> +        memset(d->vnuma.vnode_to_pnode, NUMA_NO_NODE, d->vnuma.nr_vnodes);
> +        memset(d->vnuma.vcpu_to_vnode, 0, d->max_vcpus);
> +        d->vnuma.nr_vnodes = 1;

Isn't this an open coded instance of vnuma_init_zero_topology()?
Furthermore, earlier code rejects d->vnuma.nr_vnodes > 0, i.e. the
array access above is underflowing the array. Plus
d->vnuma.vmemrange[] is in an unknown state, so if you want to
restore previous state you need to do this differently.

> +        ret = 0;

And this is not a success path - according to the goto-s above you
mean -EFAULT here.

> +        break;
> +
> + setvnumainfo_nomem:
> +        /* The only case where we set number of vnodes to 0 */
> +        d->vnuma.nr_vnodes = 0;
> +        xfree(d->vnuma.vmemrange);
> +        xfree(d->vnuma.vdistance);
> +        xfree(d->vnuma.vnode_to_pnode);
> +        xfree(d->vnuma.vcpu_to_vnode);

And this is open coded vnuma_destroy().

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -963,6 +963,73 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          break;
>  
> +    case XENMEM_get_vnuma_info:
> +    {
> +        struct vnuma_topology_info guest_topo;
> +        struct domain *d;
> +
> +        if ( copy_from_guest(&guest_topo, arg, 1) )
> +            return -EFAULT;
> +        if ( (d = rcu_lock_domain_by_any_id(guest_topo.domid)) == NULL )
> +            return -ESRCH;
> +
> +        if ( d->vnuma.nr_vnodes == 0 ) {

Coding style still.

> +            rc = -EOPNOTSUPP;
> +            goto vnumainfo_out;
> +        }
> +
> +        rc = -EOPNOTSUPP;
> +        /*
> +         * Guest may have different kernel configuration for
> +         * number of cpus/nodes. It informs about them via hypercall.
> +         */
> +        if ( guest_topo.nr_vnodes < d->vnuma.nr_vnodes ||
> +            guest_topo.nr_vcpus < d->max_vcpus )
> +            goto vnumainfo_out;

So the "no vNUMA" and "insufficient space" cases are completely
indistinguishable for the guest if you use the same error value for
them. Perhaps the latter should be -ENOBUFS?

> +
> +        rc = -EFAULT;
> +
> +        if ( guest_handle_is_null(guest_topo.vmemrange.h)    ||
> +             guest_handle_is_null(guest_topo.vdistance.h)    ||
> +             guest_handle_is_null(guest_topo.vcpu_to_vnode.h) )
> +            goto vnumainfo_out;
> +
> +        /*
> +         * Take a failure path if out of guest allocated memory for topology.
> +         * No partial copying.
> +         */
> +        guest_topo.nr_vnodes = d->vnuma.nr_vnodes;
> +
> +        if ( __copy_to_guest(guest_topo.vmemrange.h,
> +                                d->vnuma.vmemrange,
> +                                d->vnuma.nr_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( __copy_to_guest(guest_topo.vdistance.h,
> +                                d->vnuma.vdistance,
> +                                d->vnuma.nr_vnodes * d->vnuma.nr_vnodes) != 0 )
> +            goto vnumainfo_out;
> +
> +        if ( __copy_to_guest(guest_topo.vcpu_to_vnode.h,
> +                                d->vnuma.vcpu_to_vnode,
> +                                d->max_vcpus) != 0 )
> +            goto vnumainfo_out;

Again you mean copy_to_guest() everywhere here.

Furthermore, how will the caller know the number of entries filled
with the last of them? You only tell it about the actual number of
virtual nodes above... Or actually, you only _mean_ to tell it about
that number - I don't see you copying that field back out (that
would be a place where the __ version of a guest-copy function
would be correctly used, due to the earlier copy_from_guest() on
the same address range).

And finally, how is the consistency of data here ensured against
a racing XEN_DOMCTL_setvnumainfo?

> +
> +        rc = 0;
> +
> + vnumainfo_out:
> +        if ( rc != 0 )
> +            /*
> +             * In case of failure to set vNUMA topology for guest,
> +             * leave everything as it is, print error only. Tools will
> +             * show for domain vnuma topology, but wont be seen in guest.
> +             */
> +            gdprintk(XENLOG_INFO, "vNUMA: failed to copy topology info to guest.\n");

So debugging printk()-s like this please in non-RFC patches. And the
comment is talking about "set" too, while we're in "get" here.

> +struct xen_domctl_vnuma {
> +    uint32_t nr_vnodes;
> +    uint32_t __pad;

Please avoid double underscores or anything else violating C name
space rules in public headers (eventual existing badness not being an
excuse).

> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -354,6 +354,20 @@ struct xen_pod_target {
>  };
>  typedef struct xen_pod_target xen_pod_target_t;
>  
> +/*
> + * XENMEM_get_vnuma_info used by caller to get
> + * vNUMA topology constructed for particular domain.
> + *
> + * The data exchanged is presented by vnuma_topology_info.
> + */
> +#define XENMEM_get_vnuma_info               26
> +
> +/*
> + * XENMEM_get_vnuma_pnode used by guest to determine
> + * the physical node of the specified vnode.
> + */
> +/*#define XENMEM_get_vnuma_pnode              27*/

What's this? And if it was enabled, what would its use be? I thought
that we agreed that the guest shouldn't know of physical NUMA
information.

> --- /dev/null
> +++ b/xen/include/public/vnuma.h

And I know I commented on this one - the information being exposed
through a mem-op, the respective definitions should be put in memory.h.

> +struct vnuma_topology_info {
> +    /* IN */
> +    domid_t domid;
> +    /* IN/OUT */
> +    unsigned int nr_vnodes;
> +    unsigned int nr_vcpus;

Interestingly here you even document nr_vcpus as also being an
output of the hypercall.

> +    /* OUT */
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t    _pad;

Same remark as above regarding C name space violation.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -444,6 +444,7 @@ struct domain
>      nodemask_t node_affinity;
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
> +    struct vnuma_info vnuma;

This being put nicely in a sub-structure of course raises the question
whether, in order to restrict struct domain growth, this wouldn't better
be a pointer.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 8/8] add vnuma info for debug-key
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info for debug-key Elena Ufimtseva
@ 2014-06-03  9:04   ` Jan Beulich
  2014-06-04  4:13     ` Elena Ufimtseva
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2014-06-03  9:04 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, lccycc123, george.dunlap, msw,
	dario.faggioli, stefano.stabellini, ian.jackson, xen-devel

>>> On 03.06.14 at 06:53, <ufimtseva@gmail.com> wrote:

This patch appears to be duplicated (with slightly different titles) in
the series - please fix this up for the next revision.

> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data);
>  static void dump_numa(unsigned char key)
>  {
>  	s_time_t now = NOW();
> -	int i;
> +	int i, j, n, err;

"unsigned int" please for any variables used as array indexes or alike
and not explicitly needing to be signed for some reason.

>  	struct domain *d;
>  	struct page_info *page;
> +	char tmp[12];

Are you perhaps unaware of keyhandler_scratch[]?

>  	unsigned int page_num_node[MAX_NUMNODES];
>  
>  	printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
> @@ -389,6 +390,32 @@ static void dump_numa(unsigned char key)
>  
>  		for_each_online_node(i)
>  			printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +		printk("    Domain has %u vnodes\n", d->vnuma.nr_vnodes);
> +		for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) {
> +			err = snprintf(tmp, 12, "%u", d->vnuma.vnode_to_pnode[i]);
> +			if ( err < 0 )
> +				printk("        vnode %u - pnode %s,", i, "any");
> +			else
> +				printk("        vnode %u - pnode %s,", i,
> +			d->vnuma.vnode_to_pnode[i] == NUMA_NO_NODE ? "any" : tmp);

Broken indentation.

> +			printk(" %"PRIu64" MB, ",
> +				(d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20);

Here too.

> +			printk("vcpus: ");
> +
> +			for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
> +				if ( d->vnuma.vcpu_to_vnode[j] == i ) {
> +					if ( !((n + 1) % 8) )
> +						printk("%u\n", j);
> +					else if ( !(n % 8) && n != 0 )
> +							printk("%s%u ", "             ", j);
> +						else
> +							printk("%u ", j);

And yet more of them. Also the sequence of leading spaces would better
be expressed with %8u or some such, but from the looks of it properly
aligning the output here would require all of this to be done a little
differently anyway.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/8] vnuma introduction
  2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
                   ` (8 preceding siblings ...)
  2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info out on debug-key Elena Ufimtseva
@ 2014-06-03 11:37 ` Wei Liu
  2014-06-04  4:05   ` Elena Ufimtseva
  9 siblings, 1 reply; 15+ messages in thread
From: Wei Liu @ 2014-06-03 11:37 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
	dario.faggioli, lccycc123, ian.jackson, xen-devel, JBeulich,
	wei.liu2

I think this series differs from your commits in xenvnuma_v5.git tree.
Probably there are some stale patches?

I cannot see the patch to xl to parse the new config options, and
there's a duplicate patch for debug key. Probably it's better to resend
the whole series with corret diffstat and commits IMHO...

Wei.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/8] vnuma introduction
  2014-06-03 11:37 ` [PATCH v5 0/8] vnuma introduction Wei Liu
@ 2014-06-04  4:05   ` Elena Ufimtseva
  0 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-04  4:05 UTC (permalink / raw)
  To: Wei Liu
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
	Matt Wilson, Dario Faggioli, Li Yechen, Ian Jackson, xen-devel,
	Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 501 bytes --]

On Tue, Jun 3, 2014 at 7:37 AM, Wei Liu <wei.liu2@citrix.com> wrote:

> I think this series differs from your commits in xenvnuma_v5.git tree.
> Probably there are some stale patches?
>
> I cannot see the patch to xl to parse the new config options, and
> there's a duplicate patch for debug key. Probably it's better to resend
> the whole series with corret diffstat and commits IMHO...
>
>
Yes, you are correct, they are different. I will resend patches and update
git tree.


> Wei.
>



-- 
Elena

[-- Attachment #1.2: Type: text/html, Size: 1148 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 8/8] add vnuma info for debug-key
  2014-06-03  9:04   ` Jan Beulich
@ 2014-06-04  4:13     ` Elena Ufimtseva
  0 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2014-06-04  4:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Li Yechen, George Dunlap, Matt Wilson,
	Dario Faggioli, Stefano Stabellini, Ian Jackson, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3166 bytes --]

On Tue, Jun 3, 2014 at 5:04 AM, Jan Beulich <JBeulich@suse.com> wrote:

> >>> On 03.06.14 at 06:53, <ufimtseva@gmail.com> wrote:
>
> This patch appears to be duplicated (with slightly different titles) in
> the series - please fix this up for the next revision.
>

Jan, I will be resending these patches, thank you.

>
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > @@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data);
> >  static void dump_numa(unsigned char key)
> >  {
> >       s_time_t now = NOW();
> > -     int i;
> > +     int i, j, n, err;
>
> "unsigned int" please for any variables used as array indexes or alike
> and not explicitly needing to be signed for some reason.
>
> >       struct domain *d;
> >       struct page_info *page;
> > +     char tmp[12];
>
> Are you perhaps unaware of keyhandler_scratch[]?
>
> >       unsigned int page_num_node[MAX_NUMNODES];
> >
> >       printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
> > @@ -389,6 +390,32 @@ static void dump_numa(unsigned char key)
> >
> >               for_each_online_node(i)
> >                       printk("    Node %u: %u\n", i, page_num_node[i]);
> > +
> > +             printk("    Domain has %u vnodes\n", d->vnuma.nr_vnodes);
> > +             for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) {
> > +                     err = snprintf(tmp, 12, "%u",
> d->vnuma.vnode_to_pnode[i]);
> > +                     if ( err < 0 )
> > +                             printk("        vnode %u - pnode %s,", i,
> "any");
> > +                     else
> > +                             printk("        vnode %u - pnode %s,", i,
> > +                     d->vnuma.vnode_to_pnode[i] == NUMA_NO_NODE ? "any"
> : tmp);
>
> Broken indentation.
>
> > +                     printk(" %"PRIu64" MB, ",
> > +                             (d->vnuma.vmemrange[i].end -
> d->vnuma.vmemrange[i].start) >> 20);
>
> Here too.
>

The formatting in numa.c is utterly confusing. It seems I cant never get it
right.
I have tried to follow the file indentation, but it should be more to it if
I still cant get it right.
I have verified the formatting with vi command :set list and it looked good
to me.
Also I used  here :set noet ci pi sts=0 sw=4 ts=4.


> > +                     printk("vcpus: ");
> > +
> > +                     for ( j = 0, n = 0; j < d->max_vcpus; j++ ) {
> > +                             if ( d->vnuma.vcpu_to_vnode[j] == i ) {
> > +                                     if ( !((n + 1) % 8) )
> > +                                             printk("%u\n", j);
> > +                                     else if ( !(n % 8) && n != 0 )
> > +                                                     printk("%s%u ", "
>             ", j);
> > +                                             else
> > +                                                     printk("%u ", j);
>
> And yet more of them. Also the sequence of leading spaces would better
> be expressed with %8u or some such, but from the looks of it properly
> aligning the output here would require all of this to be done a little
> differently anyway.
>

Ok, let me see what I can do here.



> Jan
>
>


-- 
Elena

[-- Attachment #1.2: Type: text/html, Size: 5587 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-06-04  4:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-03  4:53 [PATCH v5 0/8] vnuma introduction Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info for debug-key Elena Ufimtseva
2014-06-03  9:04   ` Jan Beulich
2014-06-04  4:13     ` Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 1/8] xen: vnuma topoplogy and subop hypercalls Elena Ufimtseva
2014-06-03  8:55   ` Jan Beulich
2014-06-03  4:53 ` [PATCH v5 2/8] libxc: Plumb Xen with vnuma topology Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 3/8] vnuma xl.cfg.pod and idl config options Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 4/8] vnuma topology parsing routines Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 5/8] libxc: allocate domain vnuma nodes Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 6/8] libxl: build e820 map for vnodes Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 7/8] libxl: place vnuma domain nodes on numa nodes Elena Ufimtseva
2014-06-03  4:53 ` [PATCH v5 8/8] add vnuma info out on debug-key Elena Ufimtseva
2014-06-03 11:37 ` [PATCH v5 0/8] vnuma introduction Wei Liu
2014-06-04  4:05   ` Elena Ufimtseva

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.