All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] add support for vNVDIMM
@ 2015-12-29 11:31 Haozhong Zhang
  2015-12-29 11:31 ` [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb Haozhong Zhang
                   ` (5 more replies)
  0 siblings, 6 replies; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-29 11:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	Jan Beulich, Wei Liu

This patch series is the Xen part patch to provide virtual NVDIMM to
guest. The corresponding QEMU patch series is sent separately with the
title "[PATCH 0/2] add vNVDIMM support for Xen".

* Background

 NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
 supported on Intel's platform. NVDIMM devices are discovered via ACPI
 and configured by _DSM method of NVDIMM device in ACPI. Some
 documents can be found at
 [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
 [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
 [3] DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
 [4] Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
	       
 The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
 provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
 mapped into CPU's address space and are accessed via normal memory
 read/write and three special instructions (clflushopt/clwb/pcommit).

 This patch series and the corresponding QEMU patch series enable Xen
 to provide vNVDIMM devices to HVM domains.

* Design

 Supporting vNVDIMM in PMEM mode has three requirements.
  
 (1) Support special instructions to operate on cache lines
     (clflushopt & clwb) and persistent memory (pcommit).

     clflushopt and clwb take linear address as parameters and we
     allow them to be executed directly (w/o emulation) in HVM
     domains. This is done by Xen patch 1.

     pcommit is also allowed to be executed directly by L1 guest, and
     we let L1 hypervisor handle pcommit in L2 guest. This is done by
     Xen patch 2.

 (2) When NVDIMM works in PMEM mode, it must be mapped in CPU's
     address space.

     When a HVM domain is created, if it does not use the guest
     address space above 4 GB, vNVDIMM will be mapped to the guest
     address space from 4 GB. If the maximum used guest address above
     4 GB is X, vNVDIMM will be mapped to the guest address space
     above X. This is done by QEMU patch 1.

 (3) NVDIMM is discovered and configured through ACPI. A primary and
     complicated part of vNVDIMM implementation in upstream QEMU is to
     build those ACPI tables. To avoid reimplementing similar code in
     hvmloader again, we decide to reuse those ACPI tables from QEMU.

     We patch QEMU to build NFIT and other ACPI tables of vNVDIMM when
     it's used as Xen's device model, and then copy them to the end of
     guest memory below 4 GB. The guest address and size of those ACPI
     tables are saved to xenstore so that hvmloader can find those
     tables. This is done by QEMU patch 2.

     We also patch hvmloader to loader above extra ACPI tables. We
     reuse and extend the existing mechanism in hvmloader that loads
     passthrough ACPI tables for this purpose. This is done by Xen
     patch 4.

 In addition, Xen patch 3 adds a xl configuration 'nvdimm' and passes
 parsed parameters to QEMU to create vNVDIMM devices.

* Test
 (1) A patched upstream qemu is used for test. QEMU patch series is
     sent separately with title "[PATCH 0/2] add vNVDIMM support for
     Xen".  (vNVDIMM support has not been in qemu-xen commit f165e58,
     so we choose upstream qemu instead)

 (2) Prepare a memory backend file:
            dd if=/dev/zero of=/tmp/nvm0 bs=1G count=10

 (3) Add the following line to a HVM domain's configuration xl.cfg:
            nvdimm = [ 'file=/tmp/nvm0,size=10240' ]

 (4) Launch a HVM domain from above xl.cfg.

 (5) If guest Linux kernel is 4.2 or newer and kernel modules
     libnvdimm, nfit, nd_btt and nd_pmem are loaded, then you will see
     the whole nvdimm device used as a single namespace and /dev/pmem0
     will appear.



Haozhong Zhang (4):
  x86/hvm: allow guest to use clflushopt and clwb
  x86/hvm: add support for pcommit instruction
  tools/xl: add a new xl configuration 'nvdimm'
  hvmloader: add support to load extra ACPI tables from qemu

 docs/man/xl.cfg.pod.5                   | 19 ++++++++++++++
 tools/firmware/hvmloader/acpi/build.c   | 34 ++++++++++++++++++++-----
 tools/libxc/xc_cpufeature.h             |  4 ++-
 tools/libxc/xc_cpuid_x86.c              |  5 +++-
 tools/libxl/libxl_dm.c                  | 15 +++++++++--
 tools/libxl/libxl_types.idl             |  9 +++++++
 tools/libxl/xl_cmdimpl.c                | 45 +++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c                  | 10 ++++++++
 xen/arch/x86/hvm/vmx/vmcs.c             |  6 ++++-
 xen/arch/x86/hvm/vmx/vmx.c              |  1 +
 xen/arch/x86/hvm/vmx/vvmx.c             |  3 +++
 xen/include/asm-x86/cpufeature.h        |  7 +++++
 xen/include/asm-x86/hvm/vmx/vmcs.h      |  3 +++
 xen/include/asm-x86/hvm/vmx/vmx.h       |  1 +
 xen/include/public/hvm/hvm_xs_strings.h |  3 +++
 15 files changed, 154 insertions(+), 11 deletions(-)

-- 
2.4.8

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
@ 2015-12-29 11:31 ` Haozhong Zhang
  2015-12-29 15:46   ` Andrew Cooper
  2015-12-29 11:31 ` [PATCH 2/4] x86/hvm: add support for pcommit instruction Haozhong Zhang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-29 11:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	Jan Beulich, Wei Liu

Pass CPU features CLFLUSHOPT and CLWB into HVM domain so that those two
instructions can be used by guest.

The specification of above two instructions can be found in
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 tools/libxc/xc_cpufeature.h      | 3 ++-
 tools/libxc/xc_cpuid_x86.c       | 4 +++-
 xen/arch/x86/hvm/hvm.c           | 7 +++++++
 xen/include/asm-x86/cpufeature.h | 5 +++++
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
index c3ddc80..5288ac6 100644
--- a/tools/libxc/xc_cpufeature.h
+++ b/tools/libxc/xc_cpufeature.h
@@ -140,6 +140,7 @@
 #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
 #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
-
+#define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
+#define X86_FEATURE_CLWB        24 /* CLWB instruction */
 
 #endif /* __LIBXC_CPUFEATURE_H */
diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
index 8882c01..fecfd6c 100644
--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -426,7 +426,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
                         bitmaskof(X86_FEATURE_RDSEED)  |
                         bitmaskof(X86_FEATURE_ADX)  |
                         bitmaskof(X86_FEATURE_SMAP) |
-                        bitmaskof(X86_FEATURE_FSGSBASE));
+                        bitmaskof(X86_FEATURE_FSGSBASE) |
+                        bitmaskof(X86_FEATURE_CLWB) |
+                        bitmaskof(X86_FEATURE_CLFLUSHOPT));
         } else
             regs[1] = 0;
         regs[0] = regs[2] = regs[3] = 0;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 21470ec..58c83a5 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4598,6 +4598,13 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
         /* Don't expose INVPCID to non-hap hvm. */
         if ( (count == 0) && !hap_enabled(d) )
             *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID);
+
+        if ( (count == 0) && !cpu_has_clflushopt )
+            *ebx &= ~cpufeat_mask(X86_FEATURE_CLFLUSHOPT);
+
+        if ( (count == 0) && !cpu_has_clwb )
+            *ebx &= ~cpufeat_mask(X86_FEATURE_CLWB);
+
         break;
     case 0xb:
         /* Fix the x2APIC identifier. */
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index ef96514..5818228 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -162,6 +162,8 @@
 #define X86_FEATURE_RDSEED	(7*32+18) /* RDSEED instruction */
 #define X86_FEATURE_ADX		(7*32+19) /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP	(7*32+20) /* Supervisor Mode Access Prevention */
+#define X86_FEATURE_CLFLUSHOPT	(7*32+23) /* CLFLUSHOPT instruction */
+#define X86_FEATURE_CLWB	(7*32+24) /* CLWB instruction */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 8 */
 #define X86_FEATURE_PKU	(8*32+ 3) /* Protection Keys for Userspace */
@@ -234,6 +236,9 @@
 #define cpu_has_xgetbv1		boot_cpu_has(X86_FEATURE_XGETBV1)
 #define cpu_has_xsaves		boot_cpu_has(X86_FEATURE_XSAVES)
 
+#define cpu_has_clflushopt  boot_cpu_has(X86_FEATURE_CLFLUSHOPT)
+#define cpu_has_clwb        boot_cpu_has(X86_FEATURE_CLWB)
+
 enum _cache_type {
     CACHE_TYPE_NULL = 0,
     CACHE_TYPE_DATA = 1,
-- 
2.4.8

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/4] x86/hvm: add support for pcommit instruction
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
  2015-12-29 11:31 ` [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb Haozhong Zhang
@ 2015-12-29 11:31 ` Haozhong Zhang
  2015-12-29 11:31 ` [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm' Haozhong Zhang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-29 11:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	Jan Beulich, Wei Liu

Pass PCOMMIT CPU feature into HMV domain. Currently, we do not intercept
pcommit instruction for L1 guest, and allow L1 to intercept pcommit
instruction for L2 guest.

The specification of pcommit instruction can be found in
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 tools/libxc/xc_cpufeature.h        | 1 +
 tools/libxc/xc_cpuid_x86.c         | 1 +
 xen/arch/x86/hvm/hvm.c             | 3 +++
 xen/arch/x86/hvm/vmx/vmcs.c        | 6 +++++-
 xen/arch/x86/hvm/vmx/vmx.c         | 1 +
 xen/arch/x86/hvm/vmx/vvmx.c        | 3 +++
 xen/include/asm-x86/cpufeature.h   | 2 ++
 xen/include/asm-x86/hvm/vmx/vmcs.h | 3 +++
 xen/include/asm-x86/hvm/vmx/vmx.h  | 1 +
 9 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
index 5288ac6..ee53679 100644
--- a/tools/libxc/xc_cpufeature.h
+++ b/tools/libxc/xc_cpufeature.h
@@ -140,6 +140,7 @@
 #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
 #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
+#define X86_FEATURE_PCOMMIT     22 /* PCOMMIT instruction */
 #define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
 #define X86_FEATURE_CLWB        24 /* CLWB instruction */
 
diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
index fecfd6c..c142595 100644
--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -427,6 +427,7 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
                         bitmaskof(X86_FEATURE_ADX)  |
                         bitmaskof(X86_FEATURE_SMAP) |
                         bitmaskof(X86_FEATURE_FSGSBASE) |
+                        bitmaskof(X86_FEATURE_PCOMMIT) |
                         bitmaskof(X86_FEATURE_CLWB) |
                         bitmaskof(X86_FEATURE_CLFLUSHOPT));
         } else
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 58c83a5..d12f619 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4605,6 +4605,9 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
         if ( (count == 0) && !cpu_has_clwb )
             *ebx &= ~cpufeat_mask(X86_FEATURE_CLWB);
 
+        if ( (count == 0) && !cpu_has_pcommit )
+            *ebx &= ~cpufeat_mask(X86_FEATURE_PCOMMIT);
+
         break;
     case 0xb:
         /* Fix the x2APIC identifier. */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index edd4c8d..9092a98 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -242,7 +242,8 @@ static int vmx_init_vmcs_config(void)
                SECONDARY_EXEC_ENABLE_INVPCID |
                SECONDARY_EXEC_ENABLE_VM_FUNCTIONS |
                SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS |
-               SECONDARY_EXEC_XSAVES);
+               SECONDARY_EXEC_XSAVES |
+               SECONDARY_EXEC_PCOMMIT);
         rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap);
         if ( _vmx_misc_cap & VMX_MISC_VMWRITE_ALL )
             opt |= SECONDARY_EXEC_ENABLE_VMCS_SHADOWING;
@@ -1075,6 +1076,9 @@ static int construct_vmcs(struct vcpu *v)
         __vmwrite(PLE_WINDOW, ple_window);
     }
 
+    if ( cpu_has_vmx_pcommit )
+        v->arch.hvm_vmx.secondary_exec_control &= ~SECONDARY_EXEC_PCOMMIT;
+
     if ( cpu_has_vmx_secondary_exec_control )
         __vmwrite(SECONDARY_VM_EXEC_CONTROL,
                   v->arch.hvm_vmx.secondary_exec_control);
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index b918b8a..0991cdf 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3517,6 +3517,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
     case EXIT_REASON_ACCESS_LDTR_OR_TR:
     case EXIT_REASON_VMX_PREEMPTION_TIMER_EXPIRED:
     case EXIT_REASON_INVPCID:
+    case EXIT_REASON_PCOMMIT:
     /* fall through */
     default:
     exit_and_crash:
diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c
index ea1052e..271ec70 100644
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -1950,6 +1950,8 @@ int nvmx_msr_read_intercept(unsigned int msr, u64 *msr_content)
                SECONDARY_EXEC_ENABLE_VPID |
                SECONDARY_EXEC_UNRESTRICTED_GUEST |
                SECONDARY_EXEC_ENABLE_EPT;
+        if ( cpu_has_vmx_pcommit )
+            data |= SECONDARY_EXEC_PCOMMIT;
         data = gen_vmx_msr(data, 0, host_data);
         break;
     case MSR_IA32_VMX_EXIT_CTLS:
@@ -2226,6 +2228,7 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
     case EXIT_REASON_VMXON:
     case EXIT_REASON_INVEPT:
     case EXIT_REASON_XSETBV:
+    case EXIT_REASON_PCOMMIT:
         /* inject to L1 */
         nvcpu->nv_vmexit_pending = 1;
         break;
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 5818228..7491e37 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -162,6 +162,7 @@
 #define X86_FEATURE_RDSEED	(7*32+18) /* RDSEED instruction */
 #define X86_FEATURE_ADX		(7*32+19) /* ADCX, ADOX instructions */
 #define X86_FEATURE_SMAP	(7*32+20) /* Supervisor Mode Access Prevention */
+#define X86_FEATURE_PCOMMIT	(7*32+22) /* PCOMMIT instruction */
 #define X86_FEATURE_CLFLUSHOPT	(7*32+23) /* CLFLUSHOPT instruction */
 #define X86_FEATURE_CLWB	(7*32+24) /* CLWB instruction */
 
@@ -238,6 +239,7 @@
 
 #define cpu_has_clflushopt  boot_cpu_has(X86_FEATURE_CLFLUSHOPT)
 #define cpu_has_clwb        boot_cpu_has(X86_FEATURE_CLWB)
+#define cpu_has_pcommit     boot_cpu_has(X86_FEATURE_PCOMMIT)
 
 enum _cache_type {
     CACHE_TYPE_NULL = 0,
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index d1496b8..77cf8da 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -236,6 +236,7 @@ extern u32 vmx_vmentry_control;
 #define SECONDARY_EXEC_ENABLE_PML               0x00020000
 #define SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS   0x00040000
 #define SECONDARY_EXEC_XSAVES                   0x00100000
+#define SECONDARY_EXEC_PCOMMIT                  0x00200000
 extern u32 vmx_secondary_exec_control;
 
 #define VMX_EPT_EXEC_ONLY_SUPPORTED                         0x00000001
@@ -303,6 +304,8 @@ extern u64 vmx_ept_vpid_cap;
     (vmx_secondary_exec_control & SECONDARY_EXEC_ENABLE_PML)
 #define cpu_has_vmx_xsaves \
     (vmx_secondary_exec_control & SECONDARY_EXEC_XSAVES)
+#define cpu_has_vmx_pcommit \
+    (vmx_secondary_exec_control & SECONDARY_EXEC_PCOMMIT)
 
 #define VMCS_RID_TYPE_MASK              0x80000000
 
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 1719965..14f3d32 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -213,6 +213,7 @@ static inline void pi_clear_sn(struct pi_desc *pi_desc)
 #define EXIT_REASON_PML_FULL            62
 #define EXIT_REASON_XSAVES              63
 #define EXIT_REASON_XRSTORS             64
+#define EXIT_REASON_PCOMMIT             65
 
 /*
  * Interruption-information format
-- 
2.4.8

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm'
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
  2015-12-29 11:31 ` [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb Haozhong Zhang
  2015-12-29 11:31 ` [PATCH 2/4] x86/hvm: add support for pcommit instruction Haozhong Zhang
@ 2015-12-29 11:31 ` Haozhong Zhang
  2016-01-04 11:16   ` Wei Liu
  2016-01-06 12:40   ` Jan Beulich
  2015-12-29 11:31 ` [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu Haozhong Zhang
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-29 11:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	Jan Beulich, Wei Liu

This configure is used to specify vNVDIMM devices which are provided to
the guest. xl parses this configuration and passes the result to qemu
that is responsible to create vNVDIMM devices.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 docs/man/xl.cfg.pod.5       | 19 +++++++++++++++++++
 tools/libxl/libxl_dm.c      | 15 +++++++++++++--
 tools/libxl/libxl_types.idl |  9 +++++++++
 tools/libxl/xl_cmdimpl.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 8899f75..a10d28e 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,25 @@ FIFO-based event channel ABI support up to 131,071 event channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B<nvdimm=[ "NVDIMM_SPEC_STRING", "NVDIMM_SPEC_STRING", ... ]>
+
+Specifies the NVDIMM devices which are provided to the guest.
+
+Each B<NVDIMM_SPEC_STRING> is a comma-separated list of C<KEY=VALUE>
+settings, from the following list:
+
+=over 4
+
+=item C<file=PATH_TO_NVDIMM_DEVICE_FILE>
+
+Specifies the path to the file of the NVDIMM device, e.g. file=/dev/pmem0.
+
+=item C<size=MBYTES>
+
+Specifies the size in Mbytes of the NVDIMM device.
+
+=back
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 0aaefd9..6fb4bbb 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -763,6 +763,7 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
     const libxl_device_nic *nics = guest_config->nics;
     const int num_disks = guest_config->num_disks;
     const int num_nics = guest_config->num_nics;
+    const int num_nvdimms = guest_config->num_nvdimms;
     const libxl_vnc_info *vnc = libxl__dm_vnc(guest_config);
     const libxl_sdl_info *sdl = dm_sdl(guest_config);
     const char *keymap = dm_keymap(guest_config);
@@ -1124,7 +1125,6 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
                                             machinearg, max_ram_below_4g);
             }
         }
-
         if (libxl_defbool_val(b_info->u.hvm.gfx_passthru)) {
             enum libxl_gfx_passthru_kind gfx_passthru_kind =
                             libxl__detect_gfx_passthru_kind(gc, guest_config);
@@ -1140,7 +1140,8 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
                 return ERROR_INVAL;
             }
         }
-
+        if (num_nvdimms)
+            machinearg = libxl__sprintf(gc, "%s,nvdimm", machinearg);
         flexarray_append(dm_args, machinearg);
         for (i = 0; b_info->extra_hvm && b_info->extra_hvm[i] != NULL; i++)
             flexarray_append(dm_args, b_info->extra_hvm[i]);
@@ -1154,6 +1155,16 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
     flexarray_append(dm_args, GCSPRINTF("%"PRId64, ram_size));
 
     if (b_info->type == LIBXL_DOMAIN_TYPE_HVM) {
+        for (i = 0; i < num_nvdimms; i++) {
+            flexarray_append(dm_args, "-device");
+            flexarray_append(dm_args,
+                             libxl__sprintf(gc, "pc-nvdimm,file=%s,size=%"PRIu64,
+                                            guest_config->nvdimms[i].file,
+                                            guest_config->nvdimms[i].size_mb));
+        }
+    }
+
+    if (b_info->type == LIBXL_DOMAIN_TYPE_HVM) {
         if (b_info->u.hvm.hdtype == LIBXL_HDTYPE_AHCI)
             flexarray_append_pair(dm_args, "-device", "ahci,id=ahci0");
         for (i = 0; i < num_disks; i++) {
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9658356..0a955a1 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -617,6 +617,14 @@ libxl_device_vtpm = Struct("device_vtpm", [
     ("uuid",             libxl_uuid),
 ])
 
+libxl_device_nvdimm = Struct("device_nvdimm", [
+    ("backend_domid",    libxl_domid),
+    ("backend_domname",  string),
+    ("devid",            libxl_devid),
+    ("file",             string),
+    ("size_mb",          uint64),
+])
+
 libxl_device_channel = Struct("device_channel", [
     ("backend_domid", libxl_domid),
     ("backend_domname", string),
@@ -641,6 +649,7 @@ libxl_domain_config = Struct("domain_config", [
     ("vfbs", Array(libxl_device_vfb, "num_vfbs")),
     ("vkbs", Array(libxl_device_vkb, "num_vkbs")),
     ("vtpms", Array(libxl_device_vtpm, "num_vtpms")),
+    ("nvdimms", Array(libxl_device_nvdimm, "num_nvdimms")),
     # a channel manifests as a console with a name,
     # see docs/misc/channels.txt
     ("channels", Array(libxl_device_channel, "num_channels")),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index f9933cb..2db7d45 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1255,6 +1255,49 @@ static void parse_vnuma_config(const XLU_Config *config,
     free(vcpu_parsed);
 }
 
+/*
+ * NVDIMM config is in the format:
+ *   nvdimm = [ 'file=path-to-pmem-dev,size=size-of-file-in-MByte',
+ *              'file=path-to-pmem-dev,size=size-of-file-in-MByte',
+ *              ... ]
+ */
+static void parse_nvdimm_config(XLU_Config *config,
+                                libxl_domain_config *d_config)
+{
+    XLU_ConfigList *nvdimms;
+    const char *buf;
+
+    if (!xlu_cfg_get_list (config, "nvdimm", &nvdimms, 0, 0)) {
+        while ((buf = xlu_cfg_get_listitem(nvdimms,
+                                           d_config->num_nvdimms)) != NULL) {
+            libxl_device_nvdimm *nvdimm =
+                ARRAY_EXTEND_INIT(d_config->nvdimms, d_config->num_nvdimms,
+                                  libxl_device_nvdimm_init);
+            char *nvdimm_cfg_str = strdup(buf);
+            char *p, *p2;
+
+            p = strtok(nvdimm_cfg_str, ",");
+            if (!p)
+                goto next_nvdimm;
+            do {
+                while (*p == ' ')
+                    p++;
+                if ((p2 = strchr(p, '=')) == NULL)
+                    break;
+                *p2 = '\0';
+                if (!strcmp(p, "file")) {
+                    nvdimm->file = strdup(p2 + 1);
+                } else if (!strcmp(p, "size")) {
+                    nvdimm->size_mb = parse_ulong(p2 + 1);
+                }
+            } while ((p = strtok(NULL, ",")) != NULL);
+
+        next_nvdimm:
+            free(nvdimm_cfg_str);
+        }
+    }
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -2392,6 +2435,8 @@ skip_vfb:
         }
      }
 
+    parse_nvdimm_config(config, d_config);
+
     xlu_cfg_destroy(config);
 }
 
-- 
2.4.8

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
                   ` (2 preceding siblings ...)
  2015-12-29 11:31 ` [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm' Haozhong Zhang
@ 2015-12-29 11:31 ` Haozhong Zhang
  2016-01-15 17:10   ` Jan Beulich
  2016-01-06 15:37 ` [PATCH 0/4] add support for vNVDIMM Ian Campbell
  2016-01-20  3:28 ` Tian, Kevin
  5 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-29 11:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	Jan Beulich, Wei Liu

NVDIMM devices are detected and configured by software through
ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
patch extends the existing mechanism in hvmloader of loading passthrough
ACPI tables to load extra ACPI tables built by QEMU.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 tools/firmware/hvmloader/acpi/build.c   | 34 +++++++++++++++++++++++++++------
 xen/include/public/hvm/hvm_xs_strings.h |  3 +++
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 503648c..72be3e0 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -292,8 +292,10 @@ static struct acpi_20_slit *construct_slit(void)
     return slit;
 }
 
-static int construct_passthrough_tables(unsigned long *table_ptrs,
-                                        int nr_tables)
+static int construct_passthrough_tables_common(unsigned long *table_ptrs,
+                                               int nr_tables,
+                                               const char *xs_acpi_pt_addr,
+                                               const char *xs_acpi_pt_length)
 {
     const char *s;
     uint8_t *acpi_pt_addr;
@@ -304,26 +306,28 @@ static int construct_passthrough_tables(unsigned long *table_ptrs,
     uint32_t total = 0;
     uint8_t *buffer;
 
-    s = xenstore_read(HVM_XS_ACPI_PT_ADDRESS, NULL);
+    s = xenstore_read(xs_acpi_pt_addr, NULL);
     if ( s == NULL )
-        return 0;    
+        return 0;
 
     acpi_pt_addr = (uint8_t*)(uint32_t)strtoll(s, NULL, 0);
     if ( acpi_pt_addr == NULL )
         return 0;
 
-    s = xenstore_read(HVM_XS_ACPI_PT_LENGTH, NULL);
+    s = xenstore_read(xs_acpi_pt_length, NULL);
     if ( s == NULL )
         return 0;
 
     acpi_pt_length = (uint32_t)strtoll(s, NULL, 0);
 
     for ( nr_added = 0; nr_added < nr_max; nr_added++ )
-    {        
+    {
         if ( (acpi_pt_length - total) < sizeof(struct acpi_header) )
             break;
 
         header = (struct acpi_header*)acpi_pt_addr;
+        set_checksum(header, offsetof(struct acpi_header, checksum),
+                     header->length);
 
         buffer = mem_alloc(header->length, 16);
         if ( buffer == NULL )
@@ -338,6 +342,21 @@ static int construct_passthrough_tables(unsigned long *table_ptrs,
     return nr_added;
 }
 
+static int construct_passthrough_tables(unsigned long *table_ptrs,
+                                        int nr_tables)
+{
+    return construct_passthrough_tables_common(table_ptrs, nr_tables,
+                                               HVM_XS_ACPI_PT_ADDRESS,
+                                               HVM_XS_ACPI_PT_LENGTH);
+}
+
+static int construct_dm_tables(unsigned long *table_ptrs, int nr_tables)
+{
+    return construct_passthrough_tables_common(table_ptrs, nr_tables,
+                                               HVM_XS_DM_ACPI_PT_ADDRESS,
+                                               HVM_XS_DM_ACPI_PT_LENGTH);
+}
+
 static int construct_secondary_tables(unsigned long *table_ptrs,
                                       struct acpi_info *info)
 {
@@ -454,6 +473,9 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     /* Load any additional tables passed through. */
     nr_tables += construct_passthrough_tables(table_ptrs, nr_tables);
 
+    /* Load any additional tables from device model */
+    nr_tables += construct_dm_tables(table_ptrs, nr_tables);
+
     table_ptrs[nr_tables] = 0;
     return nr_tables;
 }
diff --git a/xen/include/public/hvm/hvm_xs_strings.h b/xen/include/public/hvm/hvm_xs_strings.h
index 146b0b0..4698495 100644
--- a/xen/include/public/hvm/hvm_xs_strings.h
+++ b/xen/include/public/hvm/hvm_xs_strings.h
@@ -41,6 +41,9 @@
 #define HVM_XS_ACPI_PT_ADDRESS         "hvmloader/acpi/address"
 #define HVM_XS_ACPI_PT_LENGTH          "hvmloader/acpi/length"
 
+#define HVM_XS_DM_ACPI_PT_ADDRESS      "hvmloader/dm-acpi/address"
+#define HVM_XS_DM_ACPI_PT_LENGTH       "hvmloader/dm-acpi/length"
+
 /* Any number of SMBIOS types can be passed through to an HVM guest using
  * the following xenstore values. The values specify the guest physical
  * address and length of a block of SMBIOS structures for hvmloader to use.
-- 
2.4.8

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb
  2015-12-29 11:31 ` [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb Haozhong Zhang
@ 2015-12-29 15:46   ` Andrew Cooper
  2015-12-30  1:35     ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Andrew Cooper @ 2015-12-29 15:46 UTC (permalink / raw)
  To: Haozhong Zhang, xen-devel
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, Jan Beulich, Keir Fraser

On 29/12/2015 11:31, Haozhong Zhang wrote:
> Pass CPU features CLFLUSHOPT and CLWB into HVM domain so that those two
> instructions can be used by guest.
>
> The specification of above two instructions can be found in
> https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>

Please be aware that my cpuid rework series completely changes all of 
this code.   As this patch is small and self contained, it would be best 
to get it accepted early and for me to rebase over the result.

As part of my cpuid work, I had come to the conclusion that CLFLUSHOPT, 
CLWB and PCOMMIT were all safe for all guests to use, as they deemed 
safe for cpl3 code to use.  Is there any reason why these wouldn't be 
safe for PV guests to use?

> ---
>   tools/libxc/xc_cpufeature.h      | 3 ++-
>   tools/libxc/xc_cpuid_x86.c       | 4 +++-
>   xen/arch/x86/hvm/hvm.c           | 7 +++++++
>   xen/include/asm-x86/cpufeature.h | 5 +++++
>   4 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
> index c3ddc80..5288ac6 100644
> --- a/tools/libxc/xc_cpufeature.h
> +++ b/tools/libxc/xc_cpufeature.h
> @@ -140,6 +140,7 @@
>   #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
>   #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
>   #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
> -
> +#define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
> +#define X86_FEATURE_CLWB        24 /* CLWB instruction */
>   
>   #endif /* __LIBXC_CPUFEATURE_H */
> diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
> index 8882c01..fecfd6c 100644
> --- a/tools/libxc/xc_cpuid_x86.c
> +++ b/tools/libxc/xc_cpuid_x86.c
> @@ -426,7 +426,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
>                           bitmaskof(X86_FEATURE_RDSEED)  |
>                           bitmaskof(X86_FEATURE_ADX)  |
>                           bitmaskof(X86_FEATURE_SMAP) |
> -                        bitmaskof(X86_FEATURE_FSGSBASE));
> +                        bitmaskof(X86_FEATURE_FSGSBASE) |
> +                        bitmaskof(X86_FEATURE_CLWB) |
> +                        bitmaskof(X86_FEATURE_CLFLUSHOPT));
>           } else
>               regs[1] = 0;
>           regs[0] = regs[2] = regs[3] = 0;

The entry for CLFLUSHOPT in the ISA Extension manual (August 2015) talks 
about CPUID.7(ECX=1).EBX[8:15] indicating the cache line size affected 
by the instruction.  However, I can't find any other reference to this 
information, nor an extension of the CPUID instruction in the ISA 
manual.  Should the Xen cpuid handling code be updated not to clobber this?

> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 21470ec..58c83a5 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4598,6 +4598,13 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
>           /* Don't expose INVPCID to non-hap hvm. */
>           if ( (count == 0) && !hap_enabled(d) )
>               *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID);
> +
> +        if ( (count == 0) && !cpu_has_clflushopt )
> +            *ebx &= ~cpufeat_mask(X86_FEATURE_CLFLUSHOPT);
> +
> +        if ( (count == 0) && !cpu_has_clwb )
> +            *ebx &= ~cpufeat_mask(X86_FEATURE_CLWB);

Please refactor this code along with if() in context above, to only 
check count once.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb
  2015-12-29 15:46   ` Andrew Cooper
@ 2015-12-30  1:35     ` Haozhong Zhang
  2015-12-30  2:16       ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-30  1:35 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich, Wei Liu

On 12/29/15 15:46, Andrew Cooper wrote:
> On 29/12/2015 11:31, Haozhong Zhang wrote:
> >Pass CPU features CLFLUSHOPT and CLWB into HVM domain so that those two
> >instructions can be used by guest.
> >
> >The specification of above two instructions can be found in
> >https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
> >
> >Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> 
> Please be aware that my cpuid rework series completely changes all of this
> code.   As this patch is small and self contained, it would be best to get
> it accepted early and for me to rebase over the result.
>

I'll split this patch series into two parts and put these two
instruction enabling patches in the first part.

> As part of my cpuid work, I had come to the conclusion that CLFLUSHOPT, CLWB
> and PCOMMIT were all safe for all guests to use, as they deemed safe for
> cpl3 code to use.  Is there any reason why these wouldn't be safe for PV
> guests to use?
>

Not for safety concern. These three instructions are usually used with
NVDIMM which are only implemented for HVM domains in this patch
series, so I didn't enable them for PV. I think they can be enabled
for PV later by another patch.

> >---
> >  tools/libxc/xc_cpufeature.h      | 3 ++-
> >  tools/libxc/xc_cpuid_x86.c       | 4 +++-
> >  xen/arch/x86/hvm/hvm.c           | 7 +++++++
> >  xen/include/asm-x86/cpufeature.h | 5 +++++
> >  4 files changed, 17 insertions(+), 2 deletions(-)
> >
> >diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
> >index c3ddc80..5288ac6 100644
> >--- a/tools/libxc/xc_cpufeature.h
> >+++ b/tools/libxc/xc_cpufeature.h
> >@@ -140,6 +140,7 @@
> >  #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
> >  #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
> >  #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
> >-
> >+#define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
> >+#define X86_FEATURE_CLWB        24 /* CLWB instruction */
> >  #endif /* __LIBXC_CPUFEATURE_H */
> >diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
> >index 8882c01..fecfd6c 100644
> >--- a/tools/libxc/xc_cpuid_x86.c
> >+++ b/tools/libxc/xc_cpuid_x86.c
> >@@ -426,7 +426,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
> >                          bitmaskof(X86_FEATURE_RDSEED)  |
> >                          bitmaskof(X86_FEATURE_ADX)  |
> >                          bitmaskof(X86_FEATURE_SMAP) |
> >-                        bitmaskof(X86_FEATURE_FSGSBASE));
> >+                        bitmaskof(X86_FEATURE_FSGSBASE) |
> >+                        bitmaskof(X86_FEATURE_CLWB) |
> >+                        bitmaskof(X86_FEATURE_CLFLUSHOPT));
> >          } else
> >              regs[1] = 0;
> >          regs[0] = regs[2] = regs[3] = 0;
> 
> The entry for CLFLUSHOPT in the ISA Extension manual (August 2015) talks
> about CPUID.7(ECX=1).EBX[8:15] indicating the cache line size affected by
> the instruction.  However, I can't find any other reference to this
> information, nor an extension of the CPUID instruction in the ISA manual.
> Should the Xen cpuid handling code be updated not to clobber this?
>

Yes, I missed this part and will update in the next version.

> >diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> >index 21470ec..58c83a5 100644
> >--- a/xen/arch/x86/hvm/hvm.c
> >+++ b/xen/arch/x86/hvm/hvm.c
> >@@ -4598,6 +4598,13 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
> >          /* Don't expose INVPCID to non-hap hvm. */
> >          if ( (count == 0) && !hap_enabled(d) )
> >              *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID);
> >+
> >+        if ( (count == 0) && !cpu_has_clflushopt )
> >+            *ebx &= ~cpufeat_mask(X86_FEATURE_CLFLUSHOPT);
> >+
> >+        if ( (count == 0) && !cpu_has_clwb )
> >+            *ebx &= ~cpufeat_mask(X86_FEATURE_CLWB);
> 
> Please refactor this code along with if() in context above, to only check
> count once.
>

Yes, I'll update in the next version.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb
  2015-12-30  1:35     ` Haozhong Zhang
@ 2015-12-30  2:16       ` Haozhong Zhang
  2015-12-30 10:33         ` Andrew Cooper
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2015-12-30  2:16 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel, Keir Fraser, Jan Beulich, Ian Jackson,
	Stefano Stabellini, Ian Campbell, Wei Liu, Jun Nakajima,
	Kevin Tian

On 12/30/15 09:35, Haozhong Zhang wrote:
> On 12/29/15 15:46, Andrew Cooper wrote:
> > On 29/12/2015 11:31, Haozhong Zhang wrote:
> > >Pass CPU features CLFLUSHOPT and CLWB into HVM domain so that those two
> > >instructions can be used by guest.
> > >
> > >The specification of above two instructions can be found in
> > >https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
> > >
> > >Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > 
> > Please be aware that my cpuid rework series completely changes all of this
> > code.   As this patch is small and self contained, it would be best to get
> > it accepted early and for me to rebase over the result.
> >
> 
> I'll split this patch series into two parts and put these two
> instruction enabling patches in the first part.
> 
> > As part of my cpuid work, I had come to the conclusion that CLFLUSHOPT, CLWB
> > and PCOMMIT were all safe for all guests to use, as they deemed safe for
> > cpl3 code to use.  Is there any reason why these wouldn't be safe for PV
> > guests to use?
> >
> 
> Not for safety concern. These three instructions are usually used with
> NVDIMM which are only implemented for HVM domains in this patch
> series, so I didn't enable them for PV. I think they can be enabled
> for PV later by another patch.
> 
> > >---
> > >  tools/libxc/xc_cpufeature.h      | 3 ++-
> > >  tools/libxc/xc_cpuid_x86.c       | 4 +++-
> > >  xen/arch/x86/hvm/hvm.c           | 7 +++++++
> > >  xen/include/asm-x86/cpufeature.h | 5 +++++
> > >  4 files changed, 17 insertions(+), 2 deletions(-)
> > >
> > >diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
> > >index c3ddc80..5288ac6 100644
> > >--- a/tools/libxc/xc_cpufeature.h
> > >+++ b/tools/libxc/xc_cpufeature.h
> > >@@ -140,6 +140,7 @@
> > >  #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
> > >  #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
> > >  #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
> > >-
> > >+#define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
> > >+#define X86_FEATURE_CLWB        24 /* CLWB instruction */
> > >  #endif /* __LIBXC_CPUFEATURE_H */
> > >diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
> > >index 8882c01..fecfd6c 100644
> > >--- a/tools/libxc/xc_cpuid_x86.c
> > >+++ b/tools/libxc/xc_cpuid_x86.c
> > >@@ -426,7 +426,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
> > >                          bitmaskof(X86_FEATURE_RDSEED)  |
> > >                          bitmaskof(X86_FEATURE_ADX)  |
> > >                          bitmaskof(X86_FEATURE_SMAP) |
> > >-                        bitmaskof(X86_FEATURE_FSGSBASE));
> > >+                        bitmaskof(X86_FEATURE_FSGSBASE) |
> > >+                        bitmaskof(X86_FEATURE_CLWB) |
> > >+                        bitmaskof(X86_FEATURE_CLFLUSHOPT));
> > >          } else
> > >              regs[1] = 0;
> > >          regs[0] = regs[2] = regs[3] = 0;
> > 
> > The entry for CLFLUSHOPT in the ISA Extension manual (August 2015) talks
> > about CPUID.7(ECX=1).EBX[8:15] indicating the cache line size affected by
> > the instruction. However, I can't find any other reference to this
> > information, nor an extension of the CPUID instruction in the ISA manual.
> > Should the Xen cpuid handling code be updated not to clobber this?
> >
> 
> Yes, I missed this part and will update in the next version.
>

I double-checked the manual and it says that

 "The aligned cache line size affected is also indicated with the
  CPUID instruction (bits 8 through 15 of the EBX register when the
  initial value in the EAX register is 1)"

so I guess you really meant CPUID.1.EBX[8:15]. The 0x00000001 case
branch in xc_cpuid_hvm_policy() (and its callers) has already passed
the host CPUID.1.EBX[8:15] to HVM domains, so no more action is needed
in this patch.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb
  2015-12-30  2:16       ` Haozhong Zhang
@ 2015-12-30 10:33         ` Andrew Cooper
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Cooper @ 2015-12-30 10:33 UTC (permalink / raw)
  To: xen-devel, Keir Fraser, Jan Beulich, Ian Jackson,
	Stefano Stabellini, Ian Campbell, Wei Liu, Jun Nakajima,
	Kevin Tian

On 30/12/2015 02:16, Haozhong Zhang wrote:
> On 12/30/15 09:35, Haozhong Zhang wrote:
>> On 12/29/15 15:46, Andrew Cooper wrote:
>>> On 29/12/2015 11:31, Haozhong Zhang wrote:
>>>> Pass CPU features CLFLUSHOPT and CLWB into HVM domain so that those two
>>>> instructions can be used by guest.
>>>>
>>>> The specification of above two instructions can be found in
>>>> https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
>>>>
>>>> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
>>> Please be aware that my cpuid rework series completely changes all of this
>>> code.   As this patch is small and self contained, it would be best to get
>>> it accepted early and for me to rebase over the result.
>>>
>> I'll split this patch series into two parts and put these two
>> instruction enabling patches in the first part.
>>
>>> As part of my cpuid work, I had come to the conclusion that CLFLUSHOPT, CLWB
>>> and PCOMMIT were all safe for all guests to use, as they deemed safe for
>>> cpl3 code to use.  Is there any reason why these wouldn't be safe for PV
>>> guests to use?
>>>
>> Not for safety concern. These three instructions are usually used with
>> NVDIMM which are only implemented for HVM domains in this patch
>> series, so I didn't enable them for PV. I think they can be enabled
>> for PV later by another patch.
>>
>>>> ---
>>>>   tools/libxc/xc_cpufeature.h      | 3 ++-
>>>>   tools/libxc/xc_cpuid_x86.c       | 4 +++-
>>>>   xen/arch/x86/hvm/hvm.c           | 7 +++++++
>>>>   xen/include/asm-x86/cpufeature.h | 5 +++++
>>>>   4 files changed, 17 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/tools/libxc/xc_cpufeature.h b/tools/libxc/xc_cpufeature.h
>>>> index c3ddc80..5288ac6 100644
>>>> --- a/tools/libxc/xc_cpufeature.h
>>>> +++ b/tools/libxc/xc_cpufeature.h
>>>> @@ -140,6 +140,7 @@
>>>>   #define X86_FEATURE_RDSEED      18 /* RDSEED instruction */
>>>>   #define X86_FEATURE_ADX         19 /* ADCX, ADOX instructions */
>>>>   #define X86_FEATURE_SMAP        20 /* Supervisor Mode Access Protection */
>>>> -
>>>> +#define X86_FEATURE_CLFLUSHOPT  23 /* CLFLUSHOPT instruction */
>>>> +#define X86_FEATURE_CLWB        24 /* CLWB instruction */
>>>>   #endif /* __LIBXC_CPUFEATURE_H */
>>>> diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
>>>> index 8882c01..fecfd6c 100644
>>>> --- a/tools/libxc/xc_cpuid_x86.c
>>>> +++ b/tools/libxc/xc_cpuid_x86.c
>>>> @@ -426,7 +426,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
>>>>                           bitmaskof(X86_FEATURE_RDSEED)  |
>>>>                           bitmaskof(X86_FEATURE_ADX)  |
>>>>                           bitmaskof(X86_FEATURE_SMAP) |
>>>> -                        bitmaskof(X86_FEATURE_FSGSBASE));
>>>> +                        bitmaskof(X86_FEATURE_FSGSBASE) |
>>>> +                        bitmaskof(X86_FEATURE_CLWB) |
>>>> +                        bitmaskof(X86_FEATURE_CLFLUSHOPT));
>>>>           } else
>>>>               regs[1] = 0;
>>>>           regs[0] = regs[2] = regs[3] = 0;
>>> The entry for CLFLUSHOPT in the ISA Extension manual (August 2015) talks
>>> about CPUID.7(ECX=1).EBX[8:15] indicating the cache line size affected by
>>> the instruction. However, I can't find any other reference to this
>>> information, nor an extension of the CPUID instruction in the ISA manual.
>>> Should the Xen cpuid handling code be updated not to clobber this?
>>>
>> Yes, I missed this part and will update in the next version.
>>
> I double-checked the manual and it says that
>
>   "The aligned cache line size affected is also indicated with the
>    CPUID instruction (bits 8 through 15 of the EBX register when the
>    initial value in the EAX register is 1)"
>
> so I guess you really meant CPUID.1.EBX[8:15]. The 0x00000001 case
> branch in xc_cpuid_hvm_policy() (and its callers) has already passed
> the host CPUID.1.EBX[8:15] to HVM domains, so no more action is needed
> in this patch.

Oops sorry.  Yes - I misread the paragraph in the manual.

Apologies for the noise.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm'
  2015-12-29 11:31 ` [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm' Haozhong Zhang
@ 2016-01-04 11:16   ` Wei Liu
  2016-01-06 12:40   ` Jan Beulich
  1 sibling, 0 replies; 88+ messages in thread
From: Wei Liu @ 2016-01-04 11:16 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Wei Liu

On Tue, Dec 29, 2015 at 07:31:50PM +0800, Haozhong Zhang wrote:
> This configure is used to specify vNVDIMM devices which are provided to
> the guest. xl parses this configuration and passes the result to qemu
> that is responsible to create vNVDIMM devices.
> 
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>

For the record in your latest series you said you will be sending
toolstack changes in a separate patch set so I skip these two patches
for now.

Wei.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm'
  2015-12-29 11:31 ` [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm' Haozhong Zhang
  2016-01-04 11:16   ` Wei Liu
@ 2016-01-06 12:40   ` Jan Beulich
  2016-01-06 15:28     ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-06 12:40 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

>>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -962,6 +962,25 @@ FIFO-based event channel ABI support up to 131,071 event channels.
>  Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
>  x86).
>  
> +=item B<nvdimm=[ "NVDIMM_SPEC_STRING", "NVDIMM_SPEC_STRING", ... ]>
> +
> +Specifies the NVDIMM devices which are provided to the guest.
> +
> +Each B<NVDIMM_SPEC_STRING> is a comma-separated list of C<KEY=VALUE>
> +settings, from the following list:
> +
> +=over 4
> +
> +=item C<file=PATH_TO_NVDIMM_DEVICE_FILE>
> +
> +Specifies the path to the file of the NVDIMM device, e.g. file=/dev/pmem0.
> +
> +=item C<size=MBYTES>
> +
> +Specifies the size in Mbytes of the NVDIMM device.

This looks odd: Either the entire file is meant to be passed (in
which case the size should be derivable) or you need an
(offset,size) pair here.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm'
  2016-01-06 12:40   ` Jan Beulich
@ 2016-01-06 15:28     ` Haozhong Zhang
  0 siblings, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-06 15:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

On 01/06/16 05:40, Jan Beulich wrote:
> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> > --- a/docs/man/xl.cfg.pod.5
> > +++ b/docs/man/xl.cfg.pod.5
> > @@ -962,6 +962,25 @@ FIFO-based event channel ABI support up to 131,071 event channels.
> >  Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
> >  x86).
> >  
> > +=item B<nvdimm=[ "NVDIMM_SPEC_STRING", "NVDIMM_SPEC_STRING", ... ]>
> > +
> > +Specifies the NVDIMM devices which are provided to the guest.
> > +
> > +Each B<NVDIMM_SPEC_STRING> is a comma-separated list of C<KEY=VALUE>
> > +settings, from the following list:
> > +
> > +=over 4
> > +
> > +=item C<file=PATH_TO_NVDIMM_DEVICE_FILE>
> > +
> > +Specifies the path to the file of the NVDIMM device, e.g. file=/dev/pmem0.
> > +
> > +=item C<size=MBYTES>
> > +
> > +Specifies the size in Mbytes of the NVDIMM device.
> 
> This looks odd: Either the entire file is meant to be passed (in
> which case the size should be derivable) or you need an
> (offset,size) pair here.
>

It intends to pass the entire file. I'll remove the 'size' option and
derive it either in toolstack or QEMU side.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
                   ` (3 preceding siblings ...)
  2015-12-29 11:31 ` [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu Haozhong Zhang
@ 2016-01-06 15:37 ` Ian Campbell
  2016-01-06 15:47   ` Haozhong Zhang
  2016-01-20  3:28 ` Tian, Kevin
  5 siblings, 1 reply; 88+ messages in thread
From: Ian Campbell @ 2016-01-06 15:37 UTC (permalink / raw)
  To: Haozhong Zhang, xen-devel
  Cc: Kevin Tian, Wei Liu, Jun Nakajima, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, Jan Beulich, Keir Fraser

On Tue, 2015-12-29 at 19:31 +0800, Haozhong Zhang wrote:
> This patch series is the Xen part patch to provide virtual NVDIMM to
> guest. The corresponding QEMU patch series is sent separately with the
> title "[PATCH 0/2] add vNVDIMM support for Xen".

When you send multiple related series like this please could you tag them
in the 0/N subject line somehow as to the tree they are for. Either tagging
with "[PATCH XEN 0/4]" (via git send-email --subject-prefix="PATCH XEN") or
using something like "xen: add support for ..." (and the equivalent for
other trees).

In this case I incorrectly categorised this based on the subject as a
repost of a QEMU series I had seen just before and hence ignored it. I
spotted a bit of diffstat in a reply and have now put it into my queue to
look at, but that was pure luck.

Ian.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-06 15:37 ` [PATCH 0/4] add support for vNVDIMM Ian Campbell
@ 2016-01-06 15:47   ` Haozhong Zhang
  0 siblings, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-06 15:47 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, Keir Fraser, Jun Nakajima, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Wei Liu

On 01/06/16 15:37, Ian Campbell wrote:
> On Tue, 2015-12-29 at 19:31 +0800, Haozhong Zhang wrote:
> > This patch series is the Xen part patch to provide virtual NVDIMM to
> > guest. The corresponding QEMU patch series is sent separately with the
> > title "[PATCH 0/2] add vNVDIMM support for Xen".
> 
> When you send multiple related series like this please could you tag them
> in the 0/N subject line somehow as to the tree they are for. Either tagging
> with "[PATCH XEN 0/4]" (via git send-email --subject-prefix="PATCH XEN") or
> using something like "xen: add support for ..." (and the equivalent for
> other trees).
> 
> In this case I incorrectly categorised this based on the subject as a
> repost of a QEMU series I had seen just before and hence ignored it. I
> spotted a bit of diffstat in a reply and have now put it into my queue to
> look at, but that was pure luck.
> 
> Ian.

Sorry for the trouble. I'll add tags in new versions.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2015-12-29 11:31 ` [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu Haozhong Zhang
@ 2016-01-15 17:10   ` Jan Beulich
  2016-01-18  0:52     ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-15 17:10 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

>>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> NVDIMM devices are detected and configured by software through
> ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> patch extends the existing mechanism in hvmloader of loading passthrough
> ACPI tables to load extra ACPI tables built by QEMU.

Mechanically the patch looks okay, but whether it's actually needed
depends on whether indeed we want NV RAM managed in qemu
instead of in the hypervisor (where imo it belongs); I didn' see any
reply yet to that same comment of mine made (iirc) in the context
of another patch.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-15 17:10   ` Jan Beulich
@ 2016-01-18  0:52     ` Haozhong Zhang
  2016-01-18  8:46       ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-18  0:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

On 01/15/16 10:10, Jan Beulich wrote:
> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> > NVDIMM devices are detected and configured by software through
> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> > patch extends the existing mechanism in hvmloader of loading passthrough
> > ACPI tables to load extra ACPI tables built by QEMU.
> 
> Mechanically the patch looks okay, but whether it's actually needed
> depends on whether indeed we want NV RAM managed in qemu
> instead of in the hypervisor (where imo it belongs); I didn' see any
> reply yet to that same comment of mine made (iirc) in the context
> of another patch.
> 
> Jan
> 

One purpose of this patch series is to provide vNVDIMM backed by host
NVDIMM devices. It requires some drivers to detect and manage host
NVDIMM devices (including parsing ACPI, managing labels, etc.) that
are not trivial, so I leave this work to the dom0 linux. Current Linux
kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
then mmaps them into certain range of dom0's address space and asks
Xen hypervisor to map that range of address space to a domU.

However, there are two problems in this Xen patch series and the
corresponding QEMU patch series, which may require further
changes in hypervisor and/or toolstack.

(1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
    the host NVDIMM to domU, which results VMEXIT for every guest
    read/write to the corresponding vNVDIMM devices. I'm going to find
    a way to passthrough the address space range of host NVDIMM to a
    guest domU (similarly to what xen-pt in QEMU uses)
    
(2) Xen currently does not check whether the address that QEMU asks to
    map to domU is really within the host NVDIMM address
    space. Therefore, Xen hypervisor needs a way to decide the host
    NVDIMM address space which can be done by parsing ACPI NFIT
    tables.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-18  0:52     ` Haozhong Zhang
@ 2016-01-18  8:46       ` Jan Beulich
  2016-01-19 11:37         ` Wei Liu
  2016-01-20  5:31         ` Haozhong Zhang
  0 siblings, 2 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-18  8:46 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

>>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> On 01/15/16 10:10, Jan Beulich wrote:
>> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
>> > NVDIMM devices are detected and configured by software through
>> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
>> > patch extends the existing mechanism in hvmloader of loading passthrough
>> > ACPI tables to load extra ACPI tables built by QEMU.
>> 
>> Mechanically the patch looks okay, but whether it's actually needed
>> depends on whether indeed we want NV RAM managed in qemu
>> instead of in the hypervisor (where imo it belongs); I didn' see any
>> reply yet to that same comment of mine made (iirc) in the context
>> of another patch.
> 
> One purpose of this patch series is to provide vNVDIMM backed by host
> NVDIMM devices. It requires some drivers to detect and manage host
> NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> are not trivial, so I leave this work to the dom0 linux. Current Linux
> kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> then mmaps them into certain range of dom0's address space and asks
> Xen hypervisor to map that range of address space to a domU.
> 
> However, there are two problems in this Xen patch series and the
> corresponding QEMU patch series, which may require further
> changes in hypervisor and/or toolstack.
> 
> (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
>     the host NVDIMM to domU, which results VMEXIT for every guest
>     read/write to the corresponding vNVDIMM devices. I'm going to find
>     a way to passthrough the address space range of host NVDIMM to a
>     guest domU (similarly to what xen-pt in QEMU uses)
>     
> (2) Xen currently does not check whether the address that QEMU asks to
>     map to domU is really within the host NVDIMM address
>     space. Therefore, Xen hypervisor needs a way to decide the host
>     NVDIMM address space which can be done by parsing ACPI NFIT
>     tables.

These problems are a pretty direct result of the management of
NVDIMM not being done by the hypervisor.

Stating what qemu currently does is, I'm afraid, not really serving
the purpose of hashing out whether the management of NVDIMM,
just like that of "normal" RAM, wouldn't better be done by the
hypervisor. In fact so far I haven't seen any rationale (other than
the desire to share code with KVM) for the presently chosen
solution. Yet in KVM qemu is - afaict - much more of an integral part
of the hypervisor than it is in the Xen case (and even there core
management of the memory is left to the kernel, i.e. what
constitutes the core hypervisor there).

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-18  8:46       ` Jan Beulich
@ 2016-01-19 11:37         ` Wei Liu
  2016-01-19 11:46           ` Jan Beulich
  2016-01-20  5:31         ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Wei Liu @ 2016-01-19 11:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> > NVDIMM devices are detected and configured by software through
> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> 
> >> Mechanically the patch looks okay, but whether it's actually needed
> >> depends on whether indeed we want NV RAM managed in qemu
> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> reply yet to that same comment of mine made (iirc) in the context
> >> of another patch.
> > 
> > One purpose of this patch series is to provide vNVDIMM backed by host
> > NVDIMM devices. It requires some drivers to detect and manage host
> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > then mmaps them into certain range of dom0's address space and asks
> > Xen hypervisor to map that range of address space to a domU.
> > 

OOI Do we have a viable solution to do all these non-trivial things in
core hypervisor?  Are you proposing designing a new set of hypercalls
for NVDIMM?  

Wei.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-19 11:37         ` Wei Liu
@ 2016-01-19 11:46           ` Jan Beulich
  2016-01-20  5:14             ` Tian, Kevin
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-19 11:46 UTC (permalink / raw)
  To: Wei Liu
  Cc: Haozhong Zhang, Kevin Tian, Keir Fraser, Ian Campbell,
	StefanoStabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima

>>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
>> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
>> > On 01/15/16 10:10, Jan Beulich wrote:
>> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
>> >> > NVDIMM devices are detected and configured by software through
>> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
>> >> > patch extends the existing mechanism in hvmloader of loading passthrough
>> >> > ACPI tables to load extra ACPI tables built by QEMU.
>> >> 
>> >> Mechanically the patch looks okay, but whether it's actually needed
>> >> depends on whether indeed we want NV RAM managed in qemu
>> >> instead of in the hypervisor (where imo it belongs); I didn' see any
>> >> reply yet to that same comment of mine made (iirc) in the context
>> >> of another patch.
>> > 
>> > One purpose of this patch series is to provide vNVDIMM backed by host
>> > NVDIMM devices. It requires some drivers to detect and manage host
>> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
>> > are not trivial, so I leave this work to the dom0 linux. Current Linux
>> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
>> > then mmaps them into certain range of dom0's address space and asks
>> > Xen hypervisor to map that range of address space to a domU.
>> > 
> 
> OOI Do we have a viable solution to do all these non-trivial things in
> core hypervisor?  Are you proposing designing a new set of hypercalls
> for NVDIMM?  

That's certainly a possibility; I lack sufficient detail to make myself
an opinion which route is going to be best.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
                   ` (4 preceding siblings ...)
  2016-01-06 15:37 ` [PATCH 0/4] add support for vNVDIMM Ian Campbell
@ 2016-01-20  3:28 ` Tian, Kevin
  2016-01-20 12:43   ` Stefano Stabellini
  5 siblings, 1 reply; 88+ messages in thread
From: Tian, Kevin @ 2016-01-20  3:28 UTC (permalink / raw)
  To: Zhang, Haozhong, xen-devel
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Nakajima, Jun,
	Andrew Cooper, Ian Jackson, Jan Beulich, Wei Liu

> From: Zhang, Haozhong
> Sent: Tuesday, December 29, 2015 7:32 PM
> 
> This patch series is the Xen part patch to provide virtual NVDIMM to
> guest. The corresponding QEMU patch series is sent separately with the
> title "[PATCH 0/2] add vNVDIMM support for Xen".
> 
> * Background
> 
>  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
>  supported on Intel's platform. NVDIMM devices are discovered via ACPI
>  and configured by _DSM method of NVDIMM device in ACPI. Some
>  documents can be found at
>  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
>  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
>  [3] DSM Interface Example:
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>  [4] Driver Writer's Guide:
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> 
>  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
>  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
>  mapped into CPU's address space and are accessed via normal memory
>  read/write and three special instructions (clflushopt/clwb/pcommit).
> 
>  This patch series and the corresponding QEMU patch series enable Xen
>  to provide vNVDIMM devices to HVM domains.
> 
> * Design
> 
>  Supporting vNVDIMM in PMEM mode has three requirements.
> 

Although this design is about vNVDIMM, some background of how pNVDIMM
is managed in Xen would be helpful to understand the whole design since
in PMEM mode you need map pNVDIMM into GFN addr space so there's
a matter of how pNVDIMM is allocated.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-19 11:46           ` Jan Beulich
@ 2016-01-20  5:14             ` Tian, Kevin
  2016-01-20  5:58               ` Zhang, Haozhong
  0 siblings, 1 reply; 88+ messages in thread
From: Tian, Kevin @ 2016-01-20  5:14 UTC (permalink / raw)
  To: Jan Beulich, Wei Liu
  Cc: Zhang, Haozhong, Keir Fraser, Ian Campbell, StefanoStabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Nakajima, Jun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 19, 2016 7:47 PM
> 
> >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> >> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> >> > NVDIMM devices are detected and configured by software through
> >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> >>
> >> >> Mechanically the patch looks okay, but whether it's actually needed
> >> >> depends on whether indeed we want NV RAM managed in qemu
> >> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> >> reply yet to that same comment of mine made (iirc) in the context
> >> >> of another patch.
> >> >
> >> > One purpose of this patch series is to provide vNVDIMM backed by host
> >> > NVDIMM devices. It requires some drivers to detect and manage host
> >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> >> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> >> > then mmaps them into certain range of dom0's address space and asks
> >> > Xen hypervisor to map that range of address space to a domU.
> >> >
> >
> > OOI Do we have a viable solution to do all these non-trivial things in
> > core hypervisor?  Are you proposing designing a new set of hypercalls
> > for NVDIMM?
> 
> That's certainly a possibility; I lack sufficient detail to make myself
> an opinion which route is going to be best.
> 
> Jan

Hi, Haozhong,

Are NVDIMM related ACPI table in plain text format, or do they require
a ACPI parser to decode? Is there a corresponding E820 entry?

Above information would be useful to help decide the direction.

In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM
since it's a type of memory resource while for memory we expect hypervisor
to centrally manage.

However in another thought the answer is different if we view this 
resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc.
then it should be fine to have Dom0 manage NVDIMM then Xen just controls
the mapping based on existing io permission mechanism.

Another possible point for this model is that PMEM is only one mode of 
NVDIMM device, which can be also exposed as a storage device. In the
latter case the management has to be in Dom0. So we don't need to
scatter the management role into Dom0/Xen based on different modes.

Back to your earlier questions:

> (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
>     the host NVDIMM to domU, which results VMEXIT for every guest
>     read/write to the corresponding vNVDIMM devices. I'm going to find
>     a way to passthrough the address space range of host NVDIMM to a
>     guest domU (similarly to what xen-pt in QEMU uses)
> 
> (2) Xen currently does not check whether the address that QEMU asks to
>     map to domU is really within the host NVDIMM address
>     space. Therefore, Xen hypervisor needs a way to decide the host
>     NVDIMM address space which can be done by parsing ACPI NFIT
>     tables.

If you look at how ACPI OpRegion is handled for IGD passthrough:

 241     ret = xc_domain_iomem_permission(xen_xc, xen_domid,
 242             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
 243             XEN_PCI_INTEL_OPREGION_PAGES,
 244             XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED);

 254     ret = xc_domain_memory_mapping(xen_xc, xen_domid,
 255             (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT),
 256             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
 257             XEN_PCI_INTEL_OPREGION_PAGES,
 258             DPCI_ADD_MAPPING);

Above can address your 2 questions. Xen doesn't need to tell exactly
whether the assigned range actually belongs to NVDIMM, just like
the policy for PCI assignment today.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-18  8:46       ` Jan Beulich
  2016-01-19 11:37         ` Wei Liu
@ 2016-01-20  5:31         ` Haozhong Zhang
  2016-01-20  8:46           ` Jan Beulich
  1 sibling, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20  5:31 UTC (permalink / raw)
  To: Jan Beulich, Wei Liu, Kevin Tian
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, xen-devel, Jun Nakajima

Hi Jan, Wei and Kevin,

On 01/18/16 01:46, Jan Beulich wrote:
> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> > NVDIMM devices are detected and configured by software through
> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> 
> >> Mechanically the patch looks okay, but whether it's actually needed
> >> depends on whether indeed we want NV RAM managed in qemu
> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> reply yet to that same comment of mine made (iirc) in the context
> >> of another patch.
> > 
> > One purpose of this patch series is to provide vNVDIMM backed by host
> > NVDIMM devices. It requires some drivers to detect and manage host
> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > then mmaps them into certain range of dom0's address space and asks
> > Xen hypervisor to map that range of address space to a domU.
> > 
> > However, there are two problems in this Xen patch series and the
> > corresponding QEMU patch series, which may require further
> > changes in hypervisor and/or toolstack.
> > 
> > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
> >     the host NVDIMM to domU, which results VMEXIT for every guest
> >     read/write to the corresponding vNVDIMM devices. I'm going to find
> >     a way to passthrough the address space range of host NVDIMM to a
> >     guest domU (similarly to what xen-pt in QEMU uses)
> >     
> > (2) Xen currently does not check whether the address that QEMU asks to
> >     map to domU is really within the host NVDIMM address
> >     space. Therefore, Xen hypervisor needs a way to decide the host
> >     NVDIMM address space which can be done by parsing ACPI NFIT
> >     tables.
> 
> These problems are a pretty direct result of the management of
> NVDIMM not being done by the hypervisor.
> 
> Stating what qemu currently does is, I'm afraid, not really serving
> the purpose of hashing out whether the management of NVDIMM,
> just like that of "normal" RAM, wouldn't better be done by the
> hypervisor. In fact so far I haven't seen any rationale (other than
> the desire to share code with KVM) for the presently chosen
> solution. Yet in KVM qemu is - afaict - much more of an integral part
> of the hypervisor than it is in the Xen case (and even there core
> management of the memory is left to the kernel, i.e. what
> constitutes the core hypervisor there).
> 
> Jan
> 

Sorry for the later reply, as I was reading some code and trying to
get things clear for myself.

The primary reason of current solution is to reuse existing NVDIMM
driver in Linux kernel.

One responsibility of this driver is to discover NVDIMM devices and
their parameters (e.g. which portion of an NVDIMM device can be mapped
into the system address space and which address it is mapped to) by
parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
ACPI Specification v6 and the actual code in Linux kernel
(drivers/acpi/nfit.*), it's not a trivial task.

Secondly, the driver implements a convenient block device interface to
let software access areas where NVDIMM devices are mapped. The
existing vNVDIMM implementation in QEMU uses this interface.

As Linux NVDIMM driver has already done above, why do we bother to
reimplement them in Xen?

For the two problems raised in my previous reply, following are my
thoughts.

(1) (for the first problem) QEMU mmaps /dev/pmemXX into its virtual
    address space. When it works with KVM, it calls KVM api to map
    that virtual address space range into a guest physical address
    space.

    For Xen, I'm going to do the similar thing, but Xen seems not
    provide such api. The most close one I can find is
    XEN_DOMCTL_memory_mapping (which is used by VGA passthrough in
    QEMU xen_pt_graphics), but it does not accept guest virtual
    address. Thus, I'm going to add a new one that does similar work
    but can accept guest virtual address.

(2) (for the second problem) After having looked at the corresponding
    Linux kernel code and my comments at beginning, I now doubt if
    it's necessary to parsing NFIT in Xen. Maybe I can follow what
    xen_pt_graphics does, that is to assign guest with permission to
    access the corresponding host NVDIMM address space range and then
    call the new hypercall added in (1).

    Again, a new hypercall that is similar to
    XEN_DOMCTL_iomem_permission and can accept guest virtual address
    is needed.

Any comments?

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20  5:14             ` Tian, Kevin
@ 2016-01-20  5:58               ` Zhang, Haozhong
  0 siblings, 0 replies; 88+ messages in thread
From: Zhang, Haozhong @ 2016-01-20  5:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wei Liu, Ian Campbell, StefanoStabellini, Nakajima, Jun,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Keir Fraser

On 01/20/16 13:14, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Tuesday, January 19, 2016 7:47 PM
> > 
> > >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> > > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> > >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > >> > On 01/15/16 10:10, Jan Beulich wrote:
> > >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> > >> >> > NVDIMM devices are detected and configured by software through
> > >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> > >> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> > >> >> > ACPI tables to load extra ACPI tables built by QEMU.
> > >> >>
> > >> >> Mechanically the patch looks okay, but whether it's actually needed
> > >> >> depends on whether indeed we want NV RAM managed in qemu
> > >> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> > >> >> reply yet to that same comment of mine made (iirc) in the context
> > >> >> of another patch.
> > >> >
> > >> > One purpose of this patch series is to provide vNVDIMM backed by host
> > >> > NVDIMM devices. It requires some drivers to detect and manage host
> > >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > >> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > >> > then mmaps them into certain range of dom0's address space and asks
> > >> > Xen hypervisor to map that range of address space to a domU.
> > >> >
> > >
> > > OOI Do we have a viable solution to do all these non-trivial things in
> > > core hypervisor?  Are you proposing designing a new set of hypercalls
> > > for NVDIMM?
> > 
> > That's certainly a possibility; I lack sufficient detail to make myself
> > an opinion which route is going to be best.
> > 
> > Jan
> 
> Hi, Haozhong,
> 
> Are NVDIMM related ACPI table in plain text format, or do they require
> a ACPI parser to decode? Is there a corresponding E820 entry?
>

Most in plain text format, but still the driver evaluates _FIT
(firmware interface table) method and decode is needed then.

> Above information would be useful to help decide the direction.
> 
> In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM
> since it's a type of memory resource while for memory we expect hypervisor
> to centrally manage.
> 
> However in another thought the answer is different if we view this 
> resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc.
> then it should be fine to have Dom0 manage NVDIMM then Xen just controls
> the mapping based on existing io permission mechanism.
>

It's more like a MMIO device than the normal ram.

> Another possible point for this model is that PMEM is only one mode of 
> NVDIMM device, which can be also exposed as a storage device. In the
> latter case the management has to be in Dom0. So we don't need to
> scatter the management role into Dom0/Xen based on different modes.
>

NVDIMM device in pmem mode is exposed as storage device (a block
device /dev/pmemXX) in Linux, and it's also used like a disk drive
(you can make file system on it, create files on it and even pass
files rather than a whole /dev/pmemXX to guests).

> Back to your earlier questions:
> 
> > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
> >     the host NVDIMM to domU, which results VMEXIT for every guest
> >     read/write to the corresponding vNVDIMM devices. I'm going to find
> >     a way to passthrough the address space range of host NVDIMM to a
> >     guest domU (similarly to what xen-pt in QEMU uses)
> > 
> > (2) Xen currently does not check whether the address that QEMU asks to
> >     map to domU is really within the host NVDIMM address
> >     space. Therefore, Xen hypervisor needs a way to decide the host
> >     NVDIMM address space which can be done by parsing ACPI NFIT
> >     tables.
> 
> If you look at how ACPI OpRegion is handled for IGD passthrough:
> 
>  241     ret = xc_domain_iomem_permission(xen_xc, xen_domid,
>  242             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  243             XEN_PCI_INTEL_OPREGION_PAGES,
>  244             XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED);
> 
>  254     ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>  255             (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT),
>  256             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  257             XEN_PCI_INTEL_OPREGION_PAGES,
>  258             DPCI_ADD_MAPPING);
>

Yes, I've noticed these two functions. The addition work would be
adding new ones that can accept virtual address, as QEMU has no easy
way to get the physical address of /dev/pmemXX and can only mmap them
into its virtual address space.

> Above can address your 2 questions. Xen doesn't need to tell exactly
> whether the assigned range actually belongs to NVDIMM, just like
> the policy for PCI assignment today.
>

That means Xen hypervisor can trust whatever address dom0 kernel and
QEMU provide?

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20  5:31         ` Haozhong Zhang
@ 2016-01-20  8:46           ` Jan Beulich
  2016-01-20  8:58             ` Andrew Cooper
  2016-01-20 11:04             ` Haozhong Zhang
  0 siblings, 2 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-20  8:46 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> The primary reason of current solution is to reuse existing NVDIMM
> driver in Linux kernel.

Re-using code in the Dom0 kernel has benefits and drawbacks, and
in any event needs to depend on proper layering to remain in place.
A benefit is less code duplication between Xen and Linux; along the
same lines a drawback is code duplication between various Dom0
OS variants.

> One responsibility of this driver is to discover NVDIMM devices and
> their parameters (e.g. which portion of an NVDIMM device can be mapped
> into the system address space and which address it is mapped to) by
> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> ACPI Specification v6 and the actual code in Linux kernel
> (drivers/acpi/nfit.*), it's not a trivial task.

To answer one of Kevin's questions: The NFIT table doesn't appear
to require the ACPI interpreter. They seem more like SRAT and SLIT.
Also you failed to answer Kevin's question regarding E820 entries: I
think NVDIMM (or at least parts thereof) get represented in E820 (or
the EFI memory map), and if that's the case this would be a very
strong hint towards management needing to be in the hypervisor.

> Secondly, the driver implements a convenient block device interface to
> let software access areas where NVDIMM devices are mapped. The
> existing vNVDIMM implementation in QEMU uses this interface.
> 
> As Linux NVDIMM driver has already done above, why do we bother to
> reimplement them in Xen?

See above; a possibility is that we may need a split model (block
layer parts on Dom0, "normal memory" parts in the hypervisor.
Iirc the split is being determined by firmware, and hence set in
stone by the time OS (or hypervisor) boot starts.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20  8:46           ` Jan Beulich
@ 2016-01-20  8:58             ` Andrew Cooper
  2016-01-20 10:15               ` Haozhong Zhang
  2016-01-20 11:04             ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Andrew Cooper @ 2016-01-20  8:58 UTC (permalink / raw)
  To: Jan Beulich, Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser

On 20/01/2016 08:46, Jan Beulich wrote:
>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>> The primary reason of current solution is to reuse existing NVDIMM
>> driver in Linux kernel.
> Re-using code in the Dom0 kernel has benefits and drawbacks, and
> in any event needs to depend on proper layering to remain in place.
> A benefit is less code duplication between Xen and Linux; along the
> same lines a drawback is code duplication between various Dom0
> OS variants.
>
>> One responsibility of this driver is to discover NVDIMM devices and
>> their parameters (e.g. which portion of an NVDIMM device can be mapped
>> into the system address space and which address it is mapped to) by
>> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
>> ACPI Specification v6 and the actual code in Linux kernel
>> (drivers/acpi/nfit.*), it's not a trivial task.
> To answer one of Kevin's questions: The NFIT table doesn't appear
> to require the ACPI interpreter. They seem more like SRAT and SLIT.
> Also you failed to answer Kevin's question regarding E820 entries: I
> think NVDIMM (or at least parts thereof) get represented in E820 (or
> the EFI memory map), and if that's the case this would be a very
> strong hint towards management needing to be in the hypervisor.

Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
into memory.  I am still on the dom0 side of this fence.

The real question is whether it is possible to take an NVDIMM, split it
in half, give each half to two different guests (with appropriate NFIT
tables) and that be sufficient for the guests to just work.

Either way, it needs to be a toolstack policy decision as to how to
split the resource.

~Andrew

>
>> Secondly, the driver implements a convenient block device interface to
>> let software access areas where NVDIMM devices are mapped. The
>> existing vNVDIMM implementation in QEMU uses this interface.
>>
>> As Linux NVDIMM driver has already done above, why do we bother to
>> reimplement them in Xen?
> See above; a possibility is that we may need a split model (block
> layer parts on Dom0, "normal memory" parts in the hypervisor.
> Iirc the split is being determined by firmware, and hence set in
> stone by the time OS (or hypervisor) boot starts.
>
> Jan
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20  8:58             ` Andrew Cooper
@ 2016-01-20 10:15               ` Haozhong Zhang
  2016-01-20 10:36                 ` Xiao Guangrong
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 10:15 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Xiao Guangrong, Ian Jackson, xen-devel, Jan Beulich,
	Jun Nakajima, Keir Fraser

On 01/20/16 08:58, Andrew Cooper wrote:
> On 20/01/2016 08:46, Jan Beulich wrote:
> >>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >> The primary reason of current solution is to reuse existing NVDIMM
> >> driver in Linux kernel.
> > Re-using code in the Dom0 kernel has benefits and drawbacks, and
> > in any event needs to depend on proper layering to remain in place.
> > A benefit is less code duplication between Xen and Linux; along the
> > same lines a drawback is code duplication between various Dom0
> > OS variants.
> >
> >> One responsibility of this driver is to discover NVDIMM devices and
> >> their parameters (e.g. which portion of an NVDIMM device can be mapped
> >> into the system address space and which address it is mapped to) by
> >> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> >> ACPI Specification v6 and the actual code in Linux kernel
> >> (drivers/acpi/nfit.*), it's not a trivial task.
> > To answer one of Kevin's questions: The NFIT table doesn't appear
> > to require the ACPI interpreter. They seem more like SRAT and SLIT.
> > Also you failed to answer Kevin's question regarding E820 entries: I
> > think NVDIMM (or at least parts thereof) get represented in E820 (or
> > the EFI memory map), and if that's the case this would be a very
> > strong hint towards management needing to be in the hypervisor.
>

CCing QEMU vNVDIMM maintainer: Xiao Guangrong

> Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
> into memory.  I am still on the dom0 side of this fence.
> 
> The real question is whether it is possible to take an NVDIMM, split it
> in half, give each half to two different guests (with appropriate NFIT
> tables) and that be sufficient for the guests to just work.
>

Yes, one NVDIMM device can be split into multiple parts and assigned
to different guests, and QEMU is responsible to maintain virtual NFIT
tables for each part.

> Either way, it needs to be a toolstack policy decision as to how to
> split the resource.
>

But the split does not need to be done at Xen side IMO. It can be done
by dom0 kernel and QEMU as long as they tells Xen hypervisor the
address space range of each part.

Haozhong

> ~Andrew
> 
> >
> >> Secondly, the driver implements a convenient block device interface to
> >> let software access areas where NVDIMM devices are mapped. The
> >> existing vNVDIMM implementation in QEMU uses this interface.
> >>
> >> As Linux NVDIMM driver has already done above, why do we bother to
> >> reimplement them in Xen?
> > See above; a possibility is that we may need a split model (block
> > layer parts on Dom0, "normal memory" parts in the hypervisor.
> > Iirc the split is being determined by firmware, and hence set in
> > stone by the time OS (or hypervisor) boot starts.
> >
> > Jan
> >
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 10:15               ` Haozhong Zhang
@ 2016-01-20 10:36                 ` Xiao Guangrong
  2016-01-20 13:16                   ` Andrew Cooper
  0 siblings, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 10:36 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Ian Campbell, Wei Liu, Ian Jackson,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, xen-devel,
	Keir Fraser


Hi,

On 01/20/2016 06:15 PM, Haozhong Zhang wrote:

> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>
>> Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
>> into memory.  I am still on the dom0 side of this fence.
>>
>> The real question is whether it is possible to take an NVDIMM, split it
>> in half, give each half to two different guests (with appropriate NFIT
>> tables) and that be sufficient for the guests to just work.
>>
>
> Yes, one NVDIMM device can be split into multiple parts and assigned
> to different guests, and QEMU is responsible to maintain virtual NFIT
> tables for each part.
>
>> Either way, it needs to be a toolstack policy decision as to how to
>> split the resource.

Currently, we are using NVDIMM as a block device and a DAX-based filesystem
is created upon it in Linux so that file-related accesses directly reach
the NVDIMM device.

In KVM, If the NVDIMM device need to be shared by different VMs, we can
create multiple files on the DAX-based filesystem and assign the file to
each VMs. In the future, we can enable namespace (partition-like) for PMEM
memory and assign the namespace to each VMs (current Linux driver uses the
whole PMEM as a single namespace).

I think it is not a easy work to let Xen hypervisor recognize NVDIMM device
and manager NVDIMM resource.

Thanks!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20  8:46           ` Jan Beulich
  2016-01-20  8:58             ` Andrew Cooper
@ 2016-01-20 11:04             ` Haozhong Zhang
  2016-01-20 11:20               ` Jan Beulich
  2016-01-20 15:07               ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 11:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Xiao Guangrong, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

On 01/20/16 01:46, Jan Beulich wrote:
> >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> > The primary reason of current solution is to reuse existing NVDIMM
> > driver in Linux kernel.
>

CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong

> Re-using code in the Dom0 kernel has benefits and drawbacks, and
> in any event needs to depend on proper layering to remain in place.
> A benefit is less code duplication between Xen and Linux; along the
> same lines a drawback is code duplication between various Dom0
> OS variants.
>

Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM
driver since 4.2.

> > One responsibility of this driver is to discover NVDIMM devices and
> > their parameters (e.g. which portion of an NVDIMM device can be mapped
> > into the system address space and which address it is mapped to) by
> > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> > ACPI Specification v6 and the actual code in Linux kernel
> > (drivers/acpi/nfit.*), it's not a trivial task.
> 
> To answer one of Kevin's questions: The NFIT table doesn't appear
> to require the ACPI interpreter. They seem more like SRAT and SLIT.

Sorry, I made a mistake in another reply. NFIT does not contain
anything requiring ACPI interpreter. But there are some _DSM methods
for NVDIMM in SSDT, which needs ACPI interpreter.

> Also you failed to answer Kevin's question regarding E820 entries: I
> think NVDIMM (or at least parts thereof) get represented in E820 (or
> the EFI memory map), and if that's the case this would be a very
> strong hint towards management needing to be in the hypervisor.
>

Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to
announce their locations, but newer ones that follow ACPI v6 spec do
not need E820 any more and only need ACPI NFIT (i.e. firmware may not
build E820 entries for them).

The current linux kernel can handle both legacy and new NVDIMM devices
and provide the same block device interface for them.

> > Secondly, the driver implements a convenient block device interface to
> > let software access areas where NVDIMM devices are mapped. The
> > existing vNVDIMM implementation in QEMU uses this interface.
> > 
> > As Linux NVDIMM driver has already done above, why do we bother to
> > reimplement them in Xen?
> 
> See above; a possibility is that we may need a split model (block
> layer parts on Dom0, "normal memory" parts in the hypervisor.
> Iirc the split is being determined by firmware, and hence set in
> stone by the time OS (or hypervisor) boot starts.
>

For the "normal memory" parts, do you mean parts that map the host
NVDIMM device's address space range to the guest? I'm going to
implement that part in hypervisor and expose it as a hypercall so that
it can be used by QEMU.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 11:04             ` Haozhong Zhang
@ 2016-01-20 11:20               ` Jan Beulich
  2016-01-20 15:29                 ` Xiao Guangrong
  2016-01-20 15:07               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-20 11:20 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> On 01/20/16 01:46, Jan Beulich wrote:
>> >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>> > Secondly, the driver implements a convenient block device interface to
>> > let software access areas where NVDIMM devices are mapped. The
>> > existing vNVDIMM implementation in QEMU uses this interface.
>> > 
>> > As Linux NVDIMM driver has already done above, why do we bother to
>> > reimplement them in Xen?
>> 
>> See above; a possibility is that we may need a split model (block
>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>> Iirc the split is being determined by firmware, and hence set in
>> stone by the time OS (or hypervisor) boot starts.
> 
> For the "normal memory" parts, do you mean parts that map the host
> NVDIMM device's address space range to the guest? I'm going to
> implement that part in hypervisor and expose it as a hypercall so that
> it can be used by QEMU.

To answer this I need to have my understanding of the partitioning
being done by firmware confirmed: If that's the case, then "normal"
means the part that doesn't get exposed as a block device (SSD).
In any event there's no correlation to guest exposure here.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20  3:28 ` Tian, Kevin
@ 2016-01-20 12:43   ` Stefano Stabellini
  2016-01-20 14:26     ` Zhang, Haozhong
  0 siblings, 1 reply; 88+ messages in thread
From: Stefano Stabellini @ 2016-01-20 12:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhang, Haozhong, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Nakajima, Jun, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Wei Liu

On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, December 29, 2015 7:32 PM
> > 
> > This patch series is the Xen part patch to provide virtual NVDIMM to
> > guest. The corresponding QEMU patch series is sent separately with the
> > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > 
> > * Background
> > 
> >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> >  and configured by _DSM method of NVDIMM device in ACPI. Some
> >  documents can be found at
> >  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> >  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> >  [3] DSM Interface Example:
> > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> >  [4] Driver Writer's Guide:
> > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > 
> >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> >  mapped into CPU's address space and are accessed via normal memory
> >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > 
> >  This patch series and the corresponding QEMU patch series enable Xen
> >  to provide vNVDIMM devices to HVM domains.
> > 
> > * Design
> > 
> >  Supporting vNVDIMM in PMEM mode has three requirements.
> > 
> 
> Although this design is about vNVDIMM, some background of how pNVDIMM
> is managed in Xen would be helpful to understand the whole design since
> in PMEM mode you need map pNVDIMM into GFN addr space so there's
> a matter of how pNVDIMM is allocated.

Yes, some background would be very helpful. Given that there are so many
moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
I suggest that we start with a design document for this feature.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 10:36                 ` Xiao Guangrong
@ 2016-01-20 13:16                   ` Andrew Cooper
  2016-01-20 14:29                     ` Stefano Stabellini
  2016-01-20 14:38                     ` Haozhong Zhang
  0 siblings, 2 replies; 88+ messages in thread
From: Andrew Cooper @ 2016-01-20 13:16 UTC (permalink / raw)
  To: Xiao Guangrong, Jan Beulich, Ian Campbell, Wei Liu, Ian Jackson,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, xen-devel,
	Keir Fraser

On 20/01/16 10:36, Xiao Guangrong wrote:
>
> Hi,
>
> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
>
>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>>
>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
>>> mapped
>>> into memory.  I am still on the dom0 side of this fence.
>>>
>>> The real question is whether it is possible to take an NVDIMM, split it
>>> in half, give each half to two different guests (with appropriate NFIT
>>> tables) and that be sufficient for the guests to just work.
>>>
>>
>> Yes, one NVDIMM device can be split into multiple parts and assigned
>> to different guests, and QEMU is responsible to maintain virtual NFIT
>> tables for each part.
>>
>>> Either way, it needs to be a toolstack policy decision as to how to
>>> split the resource.
>
> Currently, we are using NVDIMM as a block device and a DAX-based
> filesystem
> is created upon it in Linux so that file-related accesses directly reach
> the NVDIMM device.
>
> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> create multiple files on the DAX-based filesystem and assign the file to
> each VMs. In the future, we can enable namespace (partition-like) for
> PMEM
> memory and assign the namespace to each VMs (current Linux driver uses
> the
> whole PMEM as a single namespace).
>
> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> device
> and manager NVDIMM resource.
>
> Thanks!
>

The more I see about this, the more sure I am that we want to keep it as
a block device managed by dom0.

In the case of the DAX-based filesystem, I presume files are not
necessarily contiguous.  I also presume that this is worked around by
permuting the mapping of the virtual NVDIMM such that the it appears as
a contiguous block of addresses to the guest?

Today in Xen, Qemu already has the ability to create mappings in the
guest's address space, e.g. to map PCI device BARs.  I don't see a
conceptual difference here, although the security/permission model
certainly is more complicated.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20 12:43   ` Stefano Stabellini
@ 2016-01-20 14:26     ` Zhang, Haozhong
  2016-01-20 14:35       ` Stefano Stabellini
  0 siblings, 1 reply; 88+ messages in thread
From: Zhang, Haozhong @ 2016-01-20 14:26 UTC (permalink / raw)
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Nakajima, Jun,
	Andrew Cooper, Ian Jackson, Xiao Guangrong, xen-devel,
	Jan Beulich, Wei Liu

On 01/20/16 12:43, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > > From: Zhang, Haozhong
> > > Sent: Tuesday, December 29, 2015 7:32 PM
> > > 
> > > This patch series is the Xen part patch to provide virtual NVDIMM to
> > > guest. The corresponding QEMU patch series is sent separately with the
> > > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > > 
> > > * Background
> > > 
> > >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> > >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> > >  and configured by _DSM method of NVDIMM device in ACPI. Some
> > >  documents can be found at
> > >  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > >  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > >  [3] DSM Interface Example:
> > > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > >  [4] Driver Writer's Guide:
> > > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > > 
> > >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> > >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> > >  mapped into CPU's address space and are accessed via normal memory
> > >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > > 
> > >  This patch series and the corresponding QEMU patch series enable Xen
> > >  to provide vNVDIMM devices to HVM domains.
> > > 
> > > * Design
> > > 
> > >  Supporting vNVDIMM in PMEM mode has three requirements.
> > > 
> > 
> > Although this design is about vNVDIMM, some background of how pNVDIMM
> > is managed in Xen would be helpful to understand the whole design since
> > in PMEM mode you need map pNVDIMM into GFN addr space so there's
> > a matter of how pNVDIMM is allocated.
> 
> Yes, some background would be very helpful. Given that there are so many
> moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> I suggest that we start with a design document for this feature.

Let me prepare a design document. Basically, it would include
following contents. Please let me know if you want anything additional
to be included.

* What NVDIMM is and how it is used
* Software interface of NVDIMM
  - ACPI NFIT: what parameters are recorded and their usage
  - ACPI SSDT: what _DSM methods are provided and their functionality
  - New instructions: clflushopt/clwb/pcommit
* How the linux kernel drives NVDIMM
  - ACPI parsing
  - Block device interface
  - Partition NVDIMM devices
* How KVM/QEMU implements vNVDIMM
* What I propose to implement vNVDIMM in Xen
  - Xen hypervisor/toolstack: new instruction enabling and address mapping
  - Dom0 Linux kernel: host NVDIMM driver
  - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 13:16                   ` Andrew Cooper
@ 2016-01-20 14:29                     ` Stefano Stabellini
  2016-01-20 14:42                       ` Haozhong Zhang
  2016-01-20 14:45                       ` Andrew Cooper
  2016-01-20 14:38                     ` Haozhong Zhang
  1 sibling, 2 replies; 88+ messages in thread
From: Stefano Stabellini @ 2016-01-20 14:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich,
	Xiao Guangrong, Keir Fraser

On Wed, 20 Jan 2016, Andrew Cooper wrote:
> On 20/01/16 10:36, Xiao Guangrong wrote:
> >
> > Hi,
> >
> > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >
> >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>
> >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>> mapped
> >>> into memory.  I am still on the dom0 side of this fence.
> >>>
> >>> The real question is whether it is possible to take an NVDIMM, split it
> >>> in half, give each half to two different guests (with appropriate NFIT
> >>> tables) and that be sufficient for the guests to just work.
> >>>
> >>
> >> Yes, one NVDIMM device can be split into multiple parts and assigned
> >> to different guests, and QEMU is responsible to maintain virtual NFIT
> >> tables for each part.
> >>
> >>> Either way, it needs to be a toolstack policy decision as to how to
> >>> split the resource.
> >
> > Currently, we are using NVDIMM as a block device and a DAX-based
> > filesystem
> > is created upon it in Linux so that file-related accesses directly reach
> > the NVDIMM device.
> >
> > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > create multiple files on the DAX-based filesystem and assign the file to
> > each VMs. In the future, we can enable namespace (partition-like) for
> > PMEM
> > memory and assign the namespace to each VMs (current Linux driver uses
> > the
> > whole PMEM as a single namespace).
> >
> > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > device
> > and manager NVDIMM resource.
> >
> > Thanks!
> >
> 
> The more I see about this, the more sure I am that we want to keep it as
> a block device managed by dom0.
> 
> In the case of the DAX-based filesystem, I presume files are not
> necessarily contiguous.  I also presume that this is worked around by
> permuting the mapping of the virtual NVDIMM such that the it appears as
> a contiguous block of addresses to the guest?
> 
> Today in Xen, Qemu already has the ability to create mappings in the
> guest's address space, e.g. to map PCI device BARs.  I don't see a
> conceptual difference here, although the security/permission model
> certainly is more complicated.

I imagine that mmap'ing  these /dev/pmemXX devices require root
privileges, does it not?

I wouldn't encourage the introduction of anything else that requires
root privileges in QEMU. With QEMU running as non-root by default in
4.7, the feature will not be available unless users explicitly ask to
run QEMU as root (which they shouldn't really).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20 14:26     ` Zhang, Haozhong
@ 2016-01-20 14:35       ` Stefano Stabellini
  2016-01-20 14:47         ` Zhang, Haozhong
  0 siblings, 1 reply; 88+ messages in thread
From: Stefano Stabellini @ 2016-01-20 14:35 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Nakajima, Jun,
	Andrew Cooper, Ian Jackson, Xiao Guangrong, xen-devel,
	Jan Beulich, Wei Liu

On Wed, 20 Jan 2016, Zhang, Haozhong wrote:
> On 01/20/16 12:43, Stefano Stabellini wrote:
> > On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Tuesday, December 29, 2015 7:32 PM
> > > > 
> > > > This patch series is the Xen part patch to provide virtual NVDIMM to
> > > > guest. The corresponding QEMU patch series is sent separately with the
> > > > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > > > 
> > > > * Background
> > > > 
> > > >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> > > >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> > > >  and configured by _DSM method of NVDIMM device in ACPI. Some
> > > >  documents can be found at
> > > >  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > > >  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > > >  [3] DSM Interface Example:
> > > > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > > >  [4] Driver Writer's Guide:
> > > > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > > > 
> > > >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> > > >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> > > >  mapped into CPU's address space and are accessed via normal memory
> > > >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > > > 
> > > >  This patch series and the corresponding QEMU patch series enable Xen
> > > >  to provide vNVDIMM devices to HVM domains.
> > > > 
> > > > * Design
> > > > 
> > > >  Supporting vNVDIMM in PMEM mode has three requirements.
> > > > 
> > > 
> > > Although this design is about vNVDIMM, some background of how pNVDIMM
> > > is managed in Xen would be helpful to understand the whole design since
> > > in PMEM mode you need map pNVDIMM into GFN addr space so there's
> > > a matter of how pNVDIMM is allocated.
> > 
> > Yes, some background would be very helpful. Given that there are so many
> > moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> > I suggest that we start with a design document for this feature.
> 
> Let me prepare a design document. Basically, it would include
> following contents. Please let me know if you want anything additional
> to be included.

Thank you!


> * What NVDIMM is and how it is used
> * Software interface of NVDIMM
>   - ACPI NFIT: what parameters are recorded and their usage
>   - ACPI SSDT: what _DSM methods are provided and their functionality
>   - New instructions: clflushopt/clwb/pcommit
> * How the linux kernel drives NVDIMM
>   - ACPI parsing
>   - Block device interface
>   - Partition NVDIMM devices
> * How KVM/QEMU implements vNVDIMM

This is a very good start.


> * What I propose to implement vNVDIMM in Xen
>   - Xen hypervisor/toolstack: new instruction enabling and address mapping
>   - Dom0 Linux kernel: host NVDIMM driver
>   - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping

This is OK. It might be also good to list other options that were
discussed, but it is certainly not necessary in first instance.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 13:16                   ` Andrew Cooper
  2016-01-20 14:29                     ` Stefano Stabellini
@ 2016-01-20 14:38                     ` Haozhong Zhang
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 14:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich,
	Xiao Guangrong, Keir Fraser

On 01/20/16 13:16, Andrew Cooper wrote:
> On 20/01/16 10:36, Xiao Guangrong wrote:
> >
> > Hi,
> >
> > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >
> >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>
> >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>> mapped
> >>> into memory.  I am still on the dom0 side of this fence.
> >>>
> >>> The real question is whether it is possible to take an NVDIMM, split it
> >>> in half, give each half to two different guests (with appropriate NFIT
> >>> tables) and that be sufficient for the guests to just work.
> >>>
> >>
> >> Yes, one NVDIMM device can be split into multiple parts and assigned
> >> to different guests, and QEMU is responsible to maintain virtual NFIT
> >> tables for each part.
> >>
> >>> Either way, it needs to be a toolstack policy decision as to how to
> >>> split the resource.
> >
> > Currently, we are using NVDIMM as a block device and a DAX-based
> > filesystem
> > is created upon it in Linux so that file-related accesses directly reach
> > the NVDIMM device.
> >
> > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > create multiple files on the DAX-based filesystem and assign the file to
> > each VMs. In the future, we can enable namespace (partition-like) for
> > PMEM
> > memory and assign the namespace to each VMs (current Linux driver uses
> > the
> > whole PMEM as a single namespace).
> >
> > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > device
> > and manager NVDIMM resource.
> >
> > Thanks!
> >
> 
> The more I see about this, the more sure I am that we want to keep it as
> a block device managed by dom0.
> 
> In the case of the DAX-based filesystem, I presume files are not
> necessarily contiguous.  I also presume that this is worked around by
> permuting the mapping of the virtual NVDIMM such that the it appears as
> a contiguous block of addresses to the guest?
>

No, it's not necessary to be contiguous. We can map those
none-contiguous parts into a contiguous guest physical address space area
and QEMU fills the base and size of area in vNFIT.

> Today in Xen, Qemu already has the ability to create mappings in the
> guest's address space, e.g. to map PCI device BARs.  I don't see a
> conceptual difference here, although the security/permission model
> certainly is more complicated.
>

I'm preparing a design document and let's see afterwards what would be
a better solution.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 14:29                     ` Stefano Stabellini
@ 2016-01-20 14:42                       ` Haozhong Zhang
  2016-01-20 14:45                       ` Andrew Cooper
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 14:42 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 01/20/16 14:29, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > On 20/01/16 10:36, Xiao Guangrong wrote:
> > >
> > > Hi,
> > >
> > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > >
> > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > >>
> > >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > >>> mapped
> > >>> into memory.  I am still on the dom0 side of this fence.
> > >>>
> > >>> The real question is whether it is possible to take an NVDIMM, split it
> > >>> in half, give each half to two different guests (with appropriate NFIT
> > >>> tables) and that be sufficient for the guests to just work.
> > >>>
> > >>
> > >> Yes, one NVDIMM device can be split into multiple parts and assigned
> > >> to different guests, and QEMU is responsible to maintain virtual NFIT
> > >> tables for each part.
> > >>
> > >>> Either way, it needs to be a toolstack policy decision as to how to
> > >>> split the resource.
> > >
> > > Currently, we are using NVDIMM as a block device and a DAX-based
> > > filesystem
> > > is created upon it in Linux so that file-related accesses directly reach
> > > the NVDIMM device.
> > >
> > > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > > create multiple files on the DAX-based filesystem and assign the file to
> > > each VMs. In the future, we can enable namespace (partition-like) for
> > > PMEM
> > > memory and assign the namespace to each VMs (current Linux driver uses
> > > the
> > > whole PMEM as a single namespace).
> > >
> > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > > device
> > > and manager NVDIMM resource.
> > >
> > > Thanks!
> > >
> > 
> > The more I see about this, the more sure I am that we want to keep it as
> > a block device managed by dom0.
> > 
> > In the case of the DAX-based filesystem, I presume files are not
> > necessarily contiguous.  I also presume that this is worked around by
> > permuting the mapping of the virtual NVDIMM such that the it appears as
> > a contiguous block of addresses to the guest?
> > 
> > Today in Xen, Qemu already has the ability to create mappings in the
> > guest's address space, e.g. to map PCI device BARs.  I don't see a
> > conceptual difference here, although the security/permission model
> > certainly is more complicated.
> 
> I imagine that mmap'ing  these /dev/pmemXX devices require root
> privileges, does it not?
>

Yes, unless we assign non-root access permissions to /dev/pmemXX (but
this is not the default behavior of linux kernel so far).

> I wouldn't encourage the introduction of anything else that requires
> root privileges in QEMU. With QEMU running as non-root by default in
> 4.7, the feature will not be available unless users explicitly ask to
> run QEMU as root (which they shouldn't really).
>

Yes, I'll include those privileged operations in the design document.

Haozhong

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 14:29                     ` Stefano Stabellini
  2016-01-20 14:42                       ` Haozhong Zhang
@ 2016-01-20 14:45                       ` Andrew Cooper
  2016-01-20 14:53                         ` Haozhong Zhang
  2016-01-20 15:05                         ` Stefano Stabellini
  1 sibling, 2 replies; 88+ messages in thread
From: Andrew Cooper @ 2016-01-20 14:45 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Jun Nakajima, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On 20/01/16 14:29, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>> On 20/01/16 10:36, Xiao Guangrong wrote:
>>> Hi,
>>>
>>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
>>>
>>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>>>>
>>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
>>>>> mapped
>>>>> into memory.  I am still on the dom0 side of this fence.
>>>>>
>>>>> The real question is whether it is possible to take an NVDIMM, split it
>>>>> in half, give each half to two different guests (with appropriate NFIT
>>>>> tables) and that be sufficient for the guests to just work.
>>>>>
>>>> Yes, one NVDIMM device can be split into multiple parts and assigned
>>>> to different guests, and QEMU is responsible to maintain virtual NFIT
>>>> tables for each part.
>>>>
>>>>> Either way, it needs to be a toolstack policy decision as to how to
>>>>> split the resource.
>>> Currently, we are using NVDIMM as a block device and a DAX-based
>>> filesystem
>>> is created upon it in Linux so that file-related accesses directly reach
>>> the NVDIMM device.
>>>
>>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
>>> create multiple files on the DAX-based filesystem and assign the file to
>>> each VMs. In the future, we can enable namespace (partition-like) for
>>> PMEM
>>> memory and assign the namespace to each VMs (current Linux driver uses
>>> the
>>> whole PMEM as a single namespace).
>>>
>>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
>>> device
>>> and manager NVDIMM resource.
>>>
>>> Thanks!
>>>
>> The more I see about this, the more sure I am that we want to keep it as
>> a block device managed by dom0.
>>
>> In the case of the DAX-based filesystem, I presume files are not
>> necessarily contiguous.  I also presume that this is worked around by
>> permuting the mapping of the virtual NVDIMM such that the it appears as
>> a contiguous block of addresses to the guest?
>>
>> Today in Xen, Qemu already has the ability to create mappings in the
>> guest's address space, e.g. to map PCI device BARs.  I don't see a
>> conceptual difference here, although the security/permission model
>> certainly is more complicated.
> I imagine that mmap'ing  these /dev/pmemXX devices require root
> privileges, does it not?

I presume it does, although mmap()ing a file on a DAX filesystem will
work in the standard POSIX way.

Neither of these are sufficient however.  That gets Qemu a mapping of
the NVDIMM, not the guest.  Something, one way or another, has to turn
this into appropriate add-to-phymap hypercalls.

>
> I wouldn't encourage the introduction of anything else that requires
> root privileges in QEMU. With QEMU running as non-root by default in
> 4.7, the feature will not be available unless users explicitly ask to
> run QEMU as root (which they shouldn't really).

This isn't how design works.

First, design a feature in an architecturally correct way, and then
design an security policy to fit.  (note, both before implement happens).

We should not stunt design based on an existing implementation.  In
particular, if design shows that being a root only feature is the only
sane way of doing this, it should be a root only feature.  (I hope this
is not the case, but it shouldn't cloud the judgement of a design).

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20 14:35       ` Stefano Stabellini
@ 2016-01-20 14:47         ` Zhang, Haozhong
  2016-01-20 14:54           ` Andrew Cooper
  0 siblings, 1 reply; 88+ messages in thread
From: Zhang, Haozhong @ 2016-01-20 14:47 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Nakajima, Jun,
	Andrew Cooper, Ian Jackson, Xiao Guangrong, xen-devel,
	Jan Beulich, Wei Liu

On 01/20/16 14:35, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Zhang, Haozhong wrote:
> > On 01/20/16 12:43, Stefano Stabellini wrote:
> > > On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > > > > From: Zhang, Haozhong
> > > > > Sent: Tuesday, December 29, 2015 7:32 PM
> > > > > 
> > > > > This patch series is the Xen part patch to provide virtual NVDIMM to
> > > > > guest. The corresponding QEMU patch series is sent separately with the
> > > > > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > > > > 
> > > > > * Background
> > > > > 
> > > > >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> > > > >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> > > > >  and configured by _DSM method of NVDIMM device in ACPI. Some
> > > > >  documents can be found at
> > > > >  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > > > >  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > > > >  [3] DSM Interface Example:
> > > > > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > > > >  [4] Driver Writer's Guide:
> > > > > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > > > > 
> > > > >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> > > > >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> > > > >  mapped into CPU's address space and are accessed via normal memory
> > > > >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > > > > 
> > > > >  This patch series and the corresponding QEMU patch series enable Xen
> > > > >  to provide vNVDIMM devices to HVM domains.
> > > > > 
> > > > > * Design
> > > > > 
> > > > >  Supporting vNVDIMM in PMEM mode has three requirements.
> > > > > 
> > > > 
> > > > Although this design is about vNVDIMM, some background of how pNVDIMM
> > > > is managed in Xen would be helpful to understand the whole design since
> > > > in PMEM mode you need map pNVDIMM into GFN addr space so there's
> > > > a matter of how pNVDIMM is allocated.
> > > 
> > > Yes, some background would be very helpful. Given that there are so many
> > > moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> > > I suggest that we start with a design document for this feature.
> > 
> > Let me prepare a design document. Basically, it would include
> > following contents. Please let me know if you want anything additional
> > to be included.
> 
> Thank you!
> 
> 
> > * What NVDIMM is and how it is used
> > * Software interface of NVDIMM
> >   - ACPI NFIT: what parameters are recorded and their usage
> >   - ACPI SSDT: what _DSM methods are provided and their functionality
> >   - New instructions: clflushopt/clwb/pcommit
> > * How the linux kernel drives NVDIMM
> >   - ACPI parsing
> >   - Block device interface
> >   - Partition NVDIMM devices
> > * How KVM/QEMU implements vNVDIMM
> 
> This is a very good start.
> 
> 
> > * What I propose to implement vNVDIMM in Xen
> >   - Xen hypervisor/toolstack: new instruction enabling and address mapping
> >   - Dom0 Linux kernel: host NVDIMM driver
> >   - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping
> 
> This is OK. It might be also good to list other options that were
> discussed, but it is certainly not necessary in first instance.

I'll include them.

And one thing missed above:
* What I propose to implement vNVDIMM in Xen
  - Building vNFIT and vSSDT: copy them from QEMU to Xen toolstack

I know it is controversial and will record other options and my reason
for this choice.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 14:45                       ` Andrew Cooper
@ 2016-01-20 14:53                         ` Haozhong Zhang
  2016-01-20 15:13                           ` Konrad Rzeszutek Wilk
  2016-01-20 15:05                         ` Stefano Stabellini
  1 sibling, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 14:53 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Ian Jackson, xen-devel, Jan Beulich, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

On 01/20/16 14:45, Andrew Cooper wrote:
> On 20/01/16 14:29, Stefano Stabellini wrote:
> > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> >> On 20/01/16 10:36, Xiao Guangrong wrote:
> >>> Hi,
> >>>
> >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >>>
> >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>>>
> >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>>>> mapped
> >>>>> into memory.  I am still on the dom0 side of this fence.
> >>>>>
> >>>>> The real question is whether it is possible to take an NVDIMM, split it
> >>>>> in half, give each half to two different guests (with appropriate NFIT
> >>>>> tables) and that be sufficient for the guests to just work.
> >>>>>
> >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> >>>> tables for each part.
> >>>>
> >>>>> Either way, it needs to be a toolstack policy decision as to how to
> >>>>> split the resource.
> >>> Currently, we are using NVDIMM as a block device and a DAX-based
> >>> filesystem
> >>> is created upon it in Linux so that file-related accesses directly reach
> >>> the NVDIMM device.
> >>>
> >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> >>> create multiple files on the DAX-based filesystem and assign the file to
> >>> each VMs. In the future, we can enable namespace (partition-like) for
> >>> PMEM
> >>> memory and assign the namespace to each VMs (current Linux driver uses
> >>> the
> >>> whole PMEM as a single namespace).
> >>>
> >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> >>> device
> >>> and manager NVDIMM resource.
> >>>
> >>> Thanks!
> >>>
> >> The more I see about this, the more sure I am that we want to keep it as
> >> a block device managed by dom0.
> >>
> >> In the case of the DAX-based filesystem, I presume files are not
> >> necessarily contiguous.  I also presume that this is worked around by
> >> permuting the mapping of the virtual NVDIMM such that the it appears as
> >> a contiguous block of addresses to the guest?
> >>
> >> Today in Xen, Qemu already has the ability to create mappings in the
> >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> >> conceptual difference here, although the security/permission model
> >> certainly is more complicated.
> > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > privileges, does it not?
> 
> I presume it does, although mmap()ing a file on a DAX filesystem will
> work in the standard POSIX way.
> 
> Neither of these are sufficient however.  That gets Qemu a mapping of
> the NVDIMM, not the guest.  Something, one way or another, has to turn
> this into appropriate add-to-phymap hypercalls.
>

Yes, those hypercalls are what I'm going to add.

Haozhong

> >
> > I wouldn't encourage the introduction of anything else that requires
> > root privileges in QEMU. With QEMU running as non-root by default in
> > 4.7, the feature will not be available unless users explicitly ask to
> > run QEMU as root (which they shouldn't really).
> 
> This isn't how design works.
> 
> First, design a feature in an architecturally correct way, and then
> design an security policy to fit.  (note, both before implement happens).
> 
> We should not stunt design based on an existing implementation.  In
> particular, if design shows that being a root only feature is the only
> sane way of doing this, it should be a root only feature.  (I hope this
> is not the case, but it shouldn't cloud the judgement of a design).
> 
> ~Andrew
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20 14:47         ` Zhang, Haozhong
@ 2016-01-20 14:54           ` Andrew Cooper
  2016-01-20 15:59             ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Andrew Cooper @ 2016-01-20 14:54 UTC (permalink / raw)
  To: Stefano Stabellini, Tian, Kevin, xen-devel, Keir Fraser,
	Ian Jackson, Ian Campbell, Jan Beulich, Wei Liu, Nakajima, Jun,
	Xiao Guangrong

On 20/01/16 14:47, Zhang, Haozhong wrote:
> On 01/20/16 14:35, Stefano Stabellini wrote:
>> On Wed, 20 Jan 2016, Zhang, Haozhong wrote:
>>> On 01/20/16 12:43, Stefano Stabellini wrote:
>>>> On Wed, 20 Jan 2016, Tian, Kevin wrote:
>>>>>> From: Zhang, Haozhong
>>>>>> Sent: Tuesday, December 29, 2015 7:32 PM
>>>>>>
>>>>>> This patch series is the Xen part patch to provide virtual NVDIMM to
>>>>>> guest. The corresponding QEMU patch series is sent separately with the
>>>>>> title "[PATCH 0/2] add vNVDIMM support for Xen".
>>>>>>
>>>>>> * Background
>>>>>>
>>>>>>  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
>>>>>>  supported on Intel's platform. NVDIMM devices are discovered via ACPI
>>>>>>  and configured by _DSM method of NVDIMM device in ACPI. Some
>>>>>>  documents can be found at
>>>>>>  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
>>>>>>  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
>>>>>>  [3] DSM Interface Example:
>>>>>> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>>>>>>  [4] Driver Writer's Guide:
>>>>>> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>>>>>>
>>>>>>  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
>>>>>>  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
>>>>>>  mapped into CPU's address space and are accessed via normal memory
>>>>>>  read/write and three special instructions (clflushopt/clwb/pcommit).
>>>>>>
>>>>>>  This patch series and the corresponding QEMU patch series enable Xen
>>>>>>  to provide vNVDIMM devices to HVM domains.
>>>>>>
>>>>>> * Design
>>>>>>
>>>>>>  Supporting vNVDIMM in PMEM mode has three requirements.
>>>>>>
>>>>> Although this design is about vNVDIMM, some background of how pNVDIMM
>>>>> is managed in Xen would be helpful to understand the whole design since
>>>>> in PMEM mode you need map pNVDIMM into GFN addr space so there's
>>>>> a matter of how pNVDIMM is allocated.
>>>> Yes, some background would be very helpful. Given that there are so many
>>>> moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
>>>> I suggest that we start with a design document for this feature.
>>> Let me prepare a design document. Basically, it would include
>>> following contents. Please let me know if you want anything additional
>>> to be included.
>> Thank you!
>>
>>
>>> * What NVDIMM is and how it is used
>>> * Software interface of NVDIMM
>>>   - ACPI NFIT: what parameters are recorded and their usage
>>>   - ACPI SSDT: what _DSM methods are provided and their functionality
>>>   - New instructions: clflushopt/clwb/pcommit
>>> * How the linux kernel drives NVDIMM
>>>   - ACPI parsing
>>>   - Block device interface
>>>   - Partition NVDIMM devices
>>> * How KVM/QEMU implements vNVDIMM
>> This is a very good start.
>>
>>
>>> * What I propose to implement vNVDIMM in Xen
>>>   - Xen hypervisor/toolstack: new instruction enabling and address mapping
>>>   - Dom0 Linux kernel: host NVDIMM driver
>>>   - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping
>> This is OK. It might be also good to list other options that were
>> discussed, but it is certainly not necessary in first instance.
> I'll include them.
>
> And one thing missed above:
> * What I propose to implement vNVDIMM in Xen
>   - Building vNFIT and vSSDT: copy them from QEMU to Xen toolstack
>
> I know it is controversial and will record other options and my reason
> for this choice.

Please would you split the subjects of "how to architect guest NVDIMM
support in Xen" from "how to get suitable ACPI tables into a guest". 
The former depends on the latter, but they are two different problems to
solve and shouldn't be conflated in one issue.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 14:45                       ` Andrew Cooper
  2016-01-20 14:53                         ` Haozhong Zhang
@ 2016-01-20 15:05                         ` Stefano Stabellini
  2016-01-20 18:14                           ` Andrew Cooper
  1 sibling, 1 reply; 88+ messages in thread
From: Stefano Stabellini @ 2016-01-20 15:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich,
	Xiao Guangrong, Keir Fraser

On Wed, 20 Jan 2016, Andrew Cooper wrote:
> On 20/01/16 14:29, Stefano Stabellini wrote:
> > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> >> On 20/01/16 10:36, Xiao Guangrong wrote:
> >>> Hi,
> >>>
> >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >>>
> >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>>>
> >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>>>> mapped
> >>>>> into memory.  I am still on the dom0 side of this fence.
> >>>>>
> >>>>> The real question is whether it is possible to take an NVDIMM, split it
> >>>>> in half, give each half to two different guests (with appropriate NFIT
> >>>>> tables) and that be sufficient for the guests to just work.
> >>>>>
> >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> >>>> tables for each part.
> >>>>
> >>>>> Either way, it needs to be a toolstack policy decision as to how to
> >>>>> split the resource.
> >>> Currently, we are using NVDIMM as a block device and a DAX-based
> >>> filesystem
> >>> is created upon it in Linux so that file-related accesses directly reach
> >>> the NVDIMM device.
> >>>
> >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> >>> create multiple files on the DAX-based filesystem and assign the file to
> >>> each VMs. In the future, we can enable namespace (partition-like) for
> >>> PMEM
> >>> memory and assign the namespace to each VMs (current Linux driver uses
> >>> the
> >>> whole PMEM as a single namespace).
> >>>
> >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> >>> device
> >>> and manager NVDIMM resource.
> >>>
> >>> Thanks!
> >>>
> >> The more I see about this, the more sure I am that we want to keep it as
> >> a block device managed by dom0.
> >>
> >> In the case of the DAX-based filesystem, I presume files are not
> >> necessarily contiguous.  I also presume that this is worked around by
> >> permuting the mapping of the virtual NVDIMM such that the it appears as
> >> a contiguous block of addresses to the guest?
> >>
> >> Today in Xen, Qemu already has the ability to create mappings in the
> >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> >> conceptual difference here, although the security/permission model
> >> certainly is more complicated.
> > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > privileges, does it not?
> 
> I presume it does, although mmap()ing a file on a DAX filesystem will
> work in the standard POSIX way.
> 
> Neither of these are sufficient however.  That gets Qemu a mapping of
> the NVDIMM, not the guest.  Something, one way or another, has to turn
> this into appropriate add-to-phymap hypercalls.
> 
> >
> > I wouldn't encourage the introduction of anything else that requires
> > root privileges in QEMU. With QEMU running as non-root by default in
> > 4.7, the feature will not be available unless users explicitly ask to
> > run QEMU as root (which they shouldn't really).
> 
> This isn't how design works.
> 
> First, design a feature in an architecturally correct way, and then
> design an security policy to fit.
>
> We should not stunt design based on an existing implementation.  In
> particular, if design shows that being a root only feature is the only
> sane way of doing this, it should be a root only feature.  (I hope this
> is not the case, but it shouldn't cloud the judgement of a design).

I would argue that security is an integral part of the architecture and
should not be retrofitted into it.

Is it really a good design if the only sane way to implement it is
making it a root-only feature? I think not. Designing security policies
for pieces of software that don't have the infrastructure for them is
costly and that cost should be accounted as part of the overall cost of
the solution rather than added to it in a second stage.


> (note, both before implement happens).

That is ideal but realistically in many cases nobody is able to produce
a design before the implementation happens. There is plenty of articles
written about this since the 90s / early 00s.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 11:04             ` Haozhong Zhang
  2016-01-20 11:20               ` Jan Beulich
@ 2016-01-20 15:07               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 15:07 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Ian Campbell, Wei Liu, Ian Jackson,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, xen-devel,
	Keir Fraser, Xiao Guangrong

On Wed, Jan 20, 2016 at 07:04:49PM +0800, Haozhong Zhang wrote:
> On 01/20/16 01:46, Jan Beulich wrote:
> > >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> > > The primary reason of current solution is to reuse existing NVDIMM
> > > driver in Linux kernel.
> >
> 
> CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong
> 
> > Re-using code in the Dom0 kernel has benefits and drawbacks, and
> > in any event needs to depend on proper layering to remain in place.
> > A benefit is less code duplication between Xen and Linux; along the
> > same lines a drawback is code duplication between various Dom0
> > OS variants.
> >
> 
> Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM
> driver since 4.2.
> 
> > > One responsibility of this driver is to discover NVDIMM devices and
> > > their parameters (e.g. which portion of an NVDIMM device can be mapped
> > > into the system address space and which address it is mapped to) by
> > > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> > > ACPI Specification v6 and the actual code in Linux kernel
> > > (drivers/acpi/nfit.*), it's not a trivial task.
> > 
> > To answer one of Kevin's questions: The NFIT table doesn't appear
> > to require the ACPI interpreter. They seem more like SRAT and SLIT.
> 
> Sorry, I made a mistake in another reply. NFIT does not contain
> anything requiring ACPI interpreter. But there are some _DSM methods
> for NVDIMM in SSDT, which needs ACPI interpreter.

Right, but those are for health checks and such. Not needed for boot-time
discovery of the ranges in memory of the NVDIMM.
> 
> > Also you failed to answer Kevin's question regarding E820 entries: I
> > think NVDIMM (or at least parts thereof) get represented in E820 (or
> > the EFI memory map), and if that's the case this would be a very
> > strong hint towards management needing to be in the hypervisor.
> >
> 
> Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to
> announce their locations, but newer ones that follow ACPI v6 spec do
> not need E820 any more and only need ACPI NFIT (i.e. firmware may not
> build E820 entries for them).

I am missing something here.

Linux pvops uses an hypercall to construct its E820 (XENMEM_machine_memory_map)
see arch/x86/xen/setup.c:xen_memory_setup.

That hypercall gets an filtered E820 from the hypervisor. And the
hypervisor gets the E820 from multiboot2 - which gets it from grub2.

With the 'legacy NVDIMM' using E820_NVDIMM (type 12? 13) - they don't
show up in multiboot2 - which means Xen will ignore them (not sure
if changes them to E820_RSRV or just leaves them alone).

Anyhow for the /dev/pmem0 driver in Linux to construct an block
device on the E820_NVDIMM - it MUST have the E820 entry - but we don't
construct that.

I would think that one of the patches would be for the hypervisor
to recognize the E820_NVDIMM and associate that area with p2m_mmio
(so that the xc_memory_mapping hypercall would work on the MFNs)?

But you also mention ACPI v6 defining them an using ACPI NFIT - 
so that would be treating said system address extracted from the
ACPI NFIT just as an MMIO (except it being WB instead of UC).

Either way - Xen hypervisor should also parse the ACPI NFIT so
that it can mark that range as p2m_mmio (or does it do that by
default for any non-E820 ranges?). Does it actually need to
do that? Or is that optional?

I hope the design document will explain a bit of this.

> 
> The current linux kernel can handle both legacy and new NVDIMM devices
> and provide the same block device interface for them.

OK, so Xen would need to do that as well - so that the Linux kernel
can utilize it.
> 
> > > Secondly, the driver implements a convenient block device interface to
> > > let software access areas where NVDIMM devices are mapped. The
> > > existing vNVDIMM implementation in QEMU uses this interface.
> > > 
> > > As Linux NVDIMM driver has already done above, why do we bother to
> > > reimplement them in Xen?
> > 
> > See above; a possibility is that we may need a split model (block
> > layer parts on Dom0, "normal memory" parts in the hypervisor.
> > Iirc the split is being determined by firmware, and hence set in
> > stone by the time OS (or hypervisor) boot starts.
> >
> 
> For the "normal memory" parts, do you mean parts that map the host
> NVDIMM device's address space range to the guest? I'm going to
> implement that part in hypervisor and expose it as a hypercall so that
> it can be used by QEMU.
> 
> Haozhong
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 14:53                         ` Haozhong Zhang
@ 2016-01-20 15:13                           ` Konrad Rzeszutek Wilk
  2016-01-20 15:29                             ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 15:13 UTC (permalink / raw)
  To: Andrew Cooper, Stefano Stabellini, Kevin Tian, Wei Liu,
	Ian Campbell, Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich,
	Xiao Guangrong, Keir Fraser, bob.liu

On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote:
> On 01/20/16 14:45, Andrew Cooper wrote:
> > On 20/01/16 14:29, Stefano Stabellini wrote:
> > > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > >> On 20/01/16 10:36, Xiao Guangrong wrote:
> > >>> Hi,
> > >>>
> > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > >>>
> > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > >>>>
> > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > >>>>> mapped
> > >>>>> into memory.  I am still on the dom0 side of this fence.
> > >>>>>
> > >>>>> The real question is whether it is possible to take an NVDIMM, split it
> > >>>>> in half, give each half to two different guests (with appropriate NFIT
> > >>>>> tables) and that be sufficient for the guests to just work.
> > >>>>>
> > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> > >>>> tables for each part.
> > >>>>
> > >>>>> Either way, it needs to be a toolstack policy decision as to how to
> > >>>>> split the resource.
> > >>> Currently, we are using NVDIMM as a block device and a DAX-based
> > >>> filesystem
> > >>> is created upon it in Linux so that file-related accesses directly reach
> > >>> the NVDIMM device.
> > >>>
> > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > >>> create multiple files on the DAX-based filesystem and assign the file to
> > >>> each VMs. In the future, we can enable namespace (partition-like) for
> > >>> PMEM
> > >>> memory and assign the namespace to each VMs (current Linux driver uses
> > >>> the
> > >>> whole PMEM as a single namespace).
> > >>>
> > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > >>> device
> > >>> and manager NVDIMM resource.
> > >>>
> > >>> Thanks!
> > >>>
> > >> The more I see about this, the more sure I am that we want to keep it as
> > >> a block device managed by dom0.
> > >>
> > >> In the case of the DAX-based filesystem, I presume files are not
> > >> necessarily contiguous.  I also presume that this is worked around by
> > >> permuting the mapping of the virtual NVDIMM such that the it appears as
> > >> a contiguous block of addresses to the guest?
> > >>
> > >> Today in Xen, Qemu already has the ability to create mappings in the
> > >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> > >> conceptual difference here, although the security/permission model
> > >> certainly is more complicated.
> > > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > > privileges, does it not?
> > 
> > I presume it does, although mmap()ing a file on a DAX filesystem will
> > work in the standard POSIX way.
> > 
> > Neither of these are sufficient however.  That gets Qemu a mapping of
> > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > this into appropriate add-to-phymap hypercalls.
> >
> 
> Yes, those hypercalls are what I'm going to add.

Why?

What you need (in a rought hand-wave way) is to:
 - mount /dev/pmem0
 - mmap the file on /dev/pmem0 FS
 - walk the VMA for the file - extract the MFN (machien frame numbers)
 - feed those frame numbers to xc_memory_mapping hypercall. The
   guest pfns would be contingous.
   Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
   /dev/pmem0 FS - the guest pfns are 0x200000 upward.

   However the MFNs may be discontingous as the NVDIMM could be an
   1TB - and the 8GB file is scattered all over.

I believe that is all you would need to do?
> 
> Haozhong
> 
> > >
> > > I wouldn't encourage the introduction of anything else that requires
> > > root privileges in QEMU. With QEMU running as non-root by default in
> > > 4.7, the feature will not be available unless users explicitly ask to
> > > run QEMU as root (which they shouldn't really).
> > 
> > This isn't how design works.
> > 
> > First, design a feature in an architecturally correct way, and then
> > design an security policy to fit.  (note, both before implement happens).
> > 
> > We should not stunt design based on an existing implementation.  In
> > particular, if design shows that being a root only feature is the only
> > sane way of doing this, it should be a root only feature.  (I hope this
> > is not the case, but it shouldn't cloud the judgement of a design).
> > 
> > ~Andrew
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:13                           ` Konrad Rzeszutek Wilk
@ 2016-01-20 15:29                             ` Haozhong Zhang
  2016-01-20 15:41                               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 15:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

On 01/20/16 10:13, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote:
> > On 01/20/16 14:45, Andrew Cooper wrote:
> > > On 20/01/16 14:29, Stefano Stabellini wrote:
> > > > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > > >> On 20/01/16 10:36, Xiao Guangrong wrote:
> > > >>> Hi,
> > > >>>
> > > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > > >>>
> > > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > > >>>>
> > > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > > >>>>> mapped
> > > >>>>> into memory.  I am still on the dom0 side of this fence.
> > > >>>>>
> > > >>>>> The real question is whether it is possible to take an NVDIMM, split it
> > > >>>>> in half, give each half to two different guests (with appropriate NFIT
> > > >>>>> tables) and that be sufficient for the guests to just work.
> > > >>>>>
> > > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> > > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> > > >>>> tables for each part.
> > > >>>>
> > > >>>>> Either way, it needs to be a toolstack policy decision as to how to
> > > >>>>> split the resource.
> > > >>> Currently, we are using NVDIMM as a block device and a DAX-based
> > > >>> filesystem
> > > >>> is created upon it in Linux so that file-related accesses directly reach
> > > >>> the NVDIMM device.
> > > >>>
> > > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > > >>> create multiple files on the DAX-based filesystem and assign the file to
> > > >>> each VMs. In the future, we can enable namespace (partition-like) for
> > > >>> PMEM
> > > >>> memory and assign the namespace to each VMs (current Linux driver uses
> > > >>> the
> > > >>> whole PMEM as a single namespace).
> > > >>>
> > > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > > >>> device
> > > >>> and manager NVDIMM resource.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >> The more I see about this, the more sure I am that we want to keep it as
> > > >> a block device managed by dom0.
> > > >>
> > > >> In the case of the DAX-based filesystem, I presume files are not
> > > >> necessarily contiguous.  I also presume that this is worked around by
> > > >> permuting the mapping of the virtual NVDIMM such that the it appears as
> > > >> a contiguous block of addresses to the guest?
> > > >>
> > > >> Today in Xen, Qemu already has the ability to create mappings in the
> > > >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> > > >> conceptual difference here, although the security/permission model
> > > >> certainly is more complicated.
> > > > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > > > privileges, does it not?
> > > 
> > > I presume it does, although mmap()ing a file on a DAX filesystem will
> > > work in the standard POSIX way.
> > > 
> > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > this into appropriate add-to-phymap hypercalls.
> > >
> > 
> > Yes, those hypercalls are what I'm going to add.
> 
> Why?
> 
> What you need (in a rought hand-wave way) is to:
>  - mount /dev/pmem0
>  - mmap the file on /dev/pmem0 FS
>  - walk the VMA for the file - extract the MFN (machien frame numbers)

Can this step be done by QEMU? Or does linux kernel provide some
approach for the userspace to do the translation?

Haozhong

>  - feed those frame numbers to xc_memory_mapping hypercall. The
>    guest pfns would be contingous.
>    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
>    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> 
>    However the MFNs may be discontingous as the NVDIMM could be an
>    1TB - and the 8GB file is scattered all over.
> 
> I believe that is all you would need to do?
> > 
> > Haozhong
> > 
> > > >
> > > > I wouldn't encourage the introduction of anything else that requires
> > > > root privileges in QEMU. With QEMU running as non-root by default in
> > > > 4.7, the feature will not be available unless users explicitly ask to
> > > > run QEMU as root (which they shouldn't really).
> > > 
> > > This isn't how design works.
> > > 
> > > First, design a feature in an architecturally correct way, and then
> > > design an security policy to fit.  (note, both before implement happens).
> > > 
> > > We should not stunt design based on an existing implementation.  In
> > > particular, if design shows that being a root only feature is the only
> > > sane way of doing this, it should be a root only feature.  (I hope this
> > > is not the case, but it shouldn't cloud the judgement of a design).
> > > 
> > > ~Andrew
> > > 
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xen.org
> > > http://lists.xen.org/xen-devel
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 11:20               ` Jan Beulich
@ 2016-01-20 15:29                 ` Xiao Guangrong
  2016-01-20 15:47                   ` Konrad Rzeszutek Wilk
  2016-01-20 17:07                   ` Jan Beulich
  0 siblings, 2 replies; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 15:29 UTC (permalink / raw)
  To: Jan Beulich, Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Keir Fraser



On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>> Secondly, the driver implements a convenient block device interface to
>>>> let software access areas where NVDIMM devices are mapped. The
>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>
>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>> reimplement them in Xen?
>>>
>>> See above; a possibility is that we may need a split model (block
>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>> Iirc the split is being determined by firmware, and hence set in
>>> stone by the time OS (or hypervisor) boot starts.
>>
>> For the "normal memory" parts, do you mean parts that map the host
>> NVDIMM device's address space range to the guest? I'm going to
>> implement that part in hypervisor and expose it as a hypercall so that
>> it can be used by QEMU.
>
> To answer this I need to have my understanding of the partitioning
> being done by firmware confirmed: If that's the case, then "normal"
> means the part that doesn't get exposed as a block device (SSD).
> In any event there's no correlation to guest exposure here.

Firmware does not manage NVDIMM. All the operations of nvdimm are handled
by OS.

Actually, there are lots of things we should take into account if we move
the NVDIMM management to hypervisor:
a) ACPI NFIT interpretation
    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
    base information of NVDIMM devices which includes PMEM info, PBLK
    info, nvdimm device interleave, vendor info, etc. Let me explain it one
    by one.

    PMEM and PBLK are two modes to access NVDIMM devices:
    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
       space so that CPU can r/w it directly.
    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
       only offers two windows which are mapped to CPU's address space, the data
       window and access window, so that CPU can use these two windows to access
       the whole NVDIMM device.

    NVDIMM device is interleaved whose info is also exported so that we can
    calculate the address to access the specified NVDIMM device.

    NVDIMM devices from different vendor can have different function so that the
    vendor info is exported by NFIT to make vendor's driver work.

b) ACPI SSDT interpretation
    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
    health check etc and hotplug support.

c) Resource management
    NVDIMM resource management challenged as:
    1) PMEM is huge and it is little slower access than RAM so it is not suitable
       to manage it as page struct (i think it is not a big problem in Xen
       hypervisor?)
    2) need to partition it to it be used in multiple VMs.
    3) need to support PBLK and partition it in the future.

d) management tools support
    S.M.A.R.T? error detection and recovering?

c) hotplug support

d) third parts drivers
    Vendor drivers need to be ported to xen hypervisor and let it be supported in
    the management tool.

e) ...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:29                             ` Haozhong Zhang
@ 2016-01-20 15:41                               ` Konrad Rzeszutek Wilk
  2016-01-20 15:54                                 ` Haozhong Zhang
  2016-01-21  3:35                                 ` Bob Liu
  0 siblings, 2 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 15:41 UTC (permalink / raw)
  To: Andrew Cooper, Stefano Stabellini, Kevin Tian, Wei Liu,
	Ian Campbell, Jun Nakajima, Ian Jackson, xen-devel, Jan Beulich,
	Xiao Guangrong, Keir Fraser, bob.liu

> > > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > > this into appropriate add-to-phymap hypercalls.
> > > >
> > > 
> > > Yes, those hypercalls are what I'm going to add.
> > 
> > Why?
> > 
> > What you need (in a rought hand-wave way) is to:
> >  - mount /dev/pmem0
> >  - mmap the file on /dev/pmem0 FS
> >  - walk the VMA for the file - extract the MFN (machien frame numbers)
> 
> Can this step be done by QEMU? Or does linux kernel provide some
> approach for the userspace to do the translation?

I don't know. I would think no - as you wouldn't want the userspace
application to figure out the physical frames from the virtual
address (unless they are root). But then if you look in
/proc/<pid>/maps and /proc/<pid>/smaps there are some data there.

Hm, /proc/<pid>/pagemaps has something intersting

See pagemap_read function. That looks to be doing it?

> 
> Haozhong
> 
> >  - feed those frame numbers to xc_memory_mapping hypercall. The
> >    guest pfns would be contingous.
> >    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
> >    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> > 
> >    However the MFNs may be discontingous as the NVDIMM could be an
> >    1TB - and the 8GB file is scattered all over.
> > 
> > I believe that is all you would need to do?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:29                 ` Xiao Guangrong
@ 2016-01-20 15:47                   ` Konrad Rzeszutek Wilk
  2016-01-20 16:25                     ` Xiao Guangrong
  2016-01-20 17:07                   ` Jan Beulich
  1 sibling, 1 reply; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 15:47 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Keir Fraser

On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/20/2016 07:20 PM, Jan Beulich wrote:
> >>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> >>On 01/20/16 01:46, Jan Beulich wrote:
> >>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >>>>Secondly, the driver implements a convenient block device interface to
> >>>>let software access areas where NVDIMM devices are mapped. The
> >>>>existing vNVDIMM implementation in QEMU uses this interface.
> >>>>
> >>>>As Linux NVDIMM driver has already done above, why do we bother to
> >>>>reimplement them in Xen?
> >>>
> >>>See above; a possibility is that we may need a split model (block
> >>>layer parts on Dom0, "normal memory" parts in the hypervisor.
> >>>Iirc the split is being determined by firmware, and hence set in
> >>>stone by the time OS (or hypervisor) boot starts.
> >>
> >>For the "normal memory" parts, do you mean parts that map the host
> >>NVDIMM device's address space range to the guest? I'm going to
> >>implement that part in hypervisor and expose it as a hypercall so that
> >>it can be used by QEMU.
> >
> >To answer this I need to have my understanding of the partitioning
> >being done by firmware confirmed: If that's the case, then "normal"
> >means the part that doesn't get exposed as a block device (SSD).
> >In any event there's no correlation to guest exposure here.
> 
> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> by OS.
> 
> Actually, there are lots of things we should take into account if we move
> the NVDIMM management to hypervisor:

If you remove the block device part and just deal with pmem part then this
gets smaller.

Also the _DSM operations - I can't see them being in hypervisor - but only
in the dom0 - which would have the right software to tickle the correct
ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
an SMART operation, etc).

> a) ACPI NFIT interpretation
>    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>    base information of NVDIMM devices which includes PMEM info, PBLK
>    info, nvdimm device interleave, vendor info, etc. Let me explain it one
>    by one.

And it is a static table. As in part of the MADT.
> 
>    PMEM and PBLK are two modes to access NVDIMM devices:
>    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>       space so that CPU can r/w it directly.
>    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>       only offers two windows which are mapped to CPU's address space, the data
>       window and access window, so that CPU can use these two windows to access
>       the whole NVDIMM device.
> 
>    NVDIMM device is interleaved whose info is also exported so that we can
>    calculate the address to access the specified NVDIMM device.

Right, along with the serial numbers.
> 
>    NVDIMM devices from different vendor can have different function so that the
>    vendor info is exported by NFIT to make vendor's driver work.

via _DSM right?
> 
> b) ACPI SSDT interpretation
>    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>    health check etc and hotplug support.

Sounds like the control domain (dom0) would be in charge of that.
> 
> c) Resource management
>    NVDIMM resource management challenged as:
>    1) PMEM is huge and it is little slower access than RAM so it is not suitable
>       to manage it as page struct (i think it is not a big problem in Xen
>       hypervisor?)
>    2) need to partition it to it be used in multiple VMs.
>    3) need to support PBLK and partition it in the future.

That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
> 
> d) management tools support
>    S.M.A.R.T? error detection and recovering?
> 
> c) hotplug support

How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
to scan. That would require the hypervisor also reading this for it to
update it's data-structures.
> 
> d) third parts drivers
>    Vendor drivers need to be ported to xen hypervisor and let it be supported in
>    the management tool.

Ewww.

I presume the 'third party drivers' mean more interesting _DSM features right?
On the base level the firmware with this type of NVDIMM would still have
the basic - ACPI NFIT + E820_NVDIMM (optional).
> 
> e) ...
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:41                               ` Konrad Rzeszutek Wilk
@ 2016-01-20 15:54                                 ` Haozhong Zhang
  2016-01-21  3:35                                 ` Bob Liu
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 15:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

On 01/20/16 10:41, Konrad Rzeszutek Wilk wrote:
> > > > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > > > this into appropriate add-to-phymap hypercalls.
> > > > >
> > > > 
> > > > Yes, those hypercalls are what I'm going to add.
> > > 
> > > Why?
> > > 
> > > What you need (in a rought hand-wave way) is to:
> > >  - mount /dev/pmem0
> > >  - mmap the file on /dev/pmem0 FS
> > >  - walk the VMA for the file - extract the MFN (machien frame numbers)
> > 
> > Can this step be done by QEMU? Or does linux kernel provide some
> > approach for the userspace to do the translation?
> 
> I don't know. I would think no - as you wouldn't want the userspace
> application to figure out the physical frames from the virtual
> address (unless they are root). But then if you look in
> /proc/<pid>/maps and /proc/<pid>/smaps there are some data there.
> 
> Hm, /proc/<pid>/pagemaps has something intersting
> 
> See pagemap_read function. That looks to be doing it?
>

Interesting and good to know this. I'll have a look at it.

Thanks,
Haozhong

> > 
> > Haozhong
> > 
> > >  - feed those frame numbers to xc_memory_mapping hypercall. The
> > >    guest pfns would be contingous.
> > >    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
> > >    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> > > 
> > >    However the MFNs may be discontingous as the NVDIMM could be an
> > >    1TB - and the 8GB file is scattered all over.
> > > 
> > > I believe that is all you would need to do?
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/4] add support for vNVDIMM
  2016-01-20 14:54           ` Andrew Cooper
@ 2016-01-20 15:59             ` Haozhong Zhang
  0 siblings, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-20 15:59 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Nakajima, Jun, Ian Jackson, Xiao Guangrong, xen-devel,
	Jan Beulich, Wei Liu

On 01/20/16 14:54, Andrew Cooper wrote:
> On 20/01/16 14:47, Zhang, Haozhong wrote:
> > On 01/20/16 14:35, Stefano Stabellini wrote:
> >> On Wed, 20 Jan 2016, Zhang, Haozhong wrote:
> >>> On 01/20/16 12:43, Stefano Stabellini wrote:
> >>>> On Wed, 20 Jan 2016, Tian, Kevin wrote:
> >>>>>> From: Zhang, Haozhong
> >>>>>> Sent: Tuesday, December 29, 2015 7:32 PM
> >>>>>>
> >>>>>> This patch series is the Xen part patch to provide virtual NVDIMM to
> >>>>>> guest. The corresponding QEMU patch series is sent separately with the
> >>>>>> title "[PATCH 0/2] add vNVDIMM support for Xen".
> >>>>>>
> >>>>>> * Background
> >>>>>>
> >>>>>>  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> >>>>>>  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> >>>>>>  and configured by _DSM method of NVDIMM device in ACPI. Some
> >>>>>>  documents can be found at
> >>>>>>  [1] ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> >>>>>>  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> >>>>>>  [3] DSM Interface Example:
> >>>>>> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> >>>>>>  [4] Driver Writer's Guide:
> >>>>>> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> >>>>>>
> >>>>>>  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> >>>>>>  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> >>>>>>  mapped into CPU's address space and are accessed via normal memory
> >>>>>>  read/write and three special instructions (clflushopt/clwb/pcommit).
> >>>>>>
> >>>>>>  This patch series and the corresponding QEMU patch series enable Xen
> >>>>>>  to provide vNVDIMM devices to HVM domains.
> >>>>>>
> >>>>>> * Design
> >>>>>>
> >>>>>>  Supporting vNVDIMM in PMEM mode has three requirements.
> >>>>>>
> >>>>> Although this design is about vNVDIMM, some background of how pNVDIMM
> >>>>> is managed in Xen would be helpful to understand the whole design since
> >>>>> in PMEM mode you need map pNVDIMM into GFN addr space so there's
> >>>>> a matter of how pNVDIMM is allocated.
> >>>> Yes, some background would be very helpful. Given that there are so many
> >>>> moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> >>>> I suggest that we start with a design document for this feature.
> >>> Let me prepare a design document. Basically, it would include
> >>> following contents. Please let me know if you want anything additional
> >>> to be included.
> >> Thank you!
> >>
> >>
> >>> * What NVDIMM is and how it is used
> >>> * Software interface of NVDIMM
> >>>   - ACPI NFIT: what parameters are recorded and their usage
> >>>   - ACPI SSDT: what _DSM methods are provided and their functionality
> >>>   - New instructions: clflushopt/clwb/pcommit
> >>> * How the linux kernel drives NVDIMM
> >>>   - ACPI parsing
> >>>   - Block device interface
> >>>   - Partition NVDIMM devices
> >>> * How KVM/QEMU implements vNVDIMM
> >> This is a very good start.
> >>
> >>
> >>> * What I propose to implement vNVDIMM in Xen
> >>>   - Xen hypervisor/toolstack: new instruction enabling and address mapping
> >>>   - Dom0 Linux kernel: host NVDIMM driver
> >>>   - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping
> >> This is OK. It might be also good to list other options that were
> >> discussed, but it is certainly not necessary in first instance.
> > I'll include them.
> >
> > And one thing missed above:
> > * What I propose to implement vNVDIMM in Xen
> >   - Building vNFIT and vSSDT: copy them from QEMU to Xen toolstack
> >
> > I know it is controversial and will record other options and my reason
> > for this choice.
> 
> Please would you split the subjects of "how to architect guest NVDIMM
> support in Xen" from "how to get suitable ACPI tables into a guest". 
> The former depends on the latter, but they are two different problems to
> solve and shouldn't be conflated in one issue.
> 
> ~Andrew
>

Sure.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:47                   ` Konrad Rzeszutek Wilk
@ 2016-01-20 16:25                     ` Xiao Guangrong
  2016-01-20 16:47                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 16:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Jun Nakajima, Keir Fraser



On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>>>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>>>> Secondly, the driver implements a convenient block device interface to
>>>>>> let software access areas where NVDIMM devices are mapped. The
>>>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>>>
>>>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>>>> reimplement them in Xen?
>>>>>
>>>>> See above; a possibility is that we may need a split model (block
>>>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>>>> Iirc the split is being determined by firmware, and hence set in
>>>>> stone by the time OS (or hypervisor) boot starts.
>>>>
>>>> For the "normal memory" parts, do you mean parts that map the host
>>>> NVDIMM device's address space range to the guest? I'm going to
>>>> implement that part in hypervisor and expose it as a hypercall so that
>>>> it can be used by QEMU.
>>>
>>> To answer this I need to have my understanding of the partitioning
>>> being done by firmware confirmed: If that's the case, then "normal"
>>> means the part that doesn't get exposed as a block device (SSD).
>>> In any event there's no correlation to guest exposure here.
>>
>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>> by OS.
>>
>> Actually, there are lots of things we should take into account if we move
>> the NVDIMM management to hypervisor:
>
> If you remove the block device part and just deal with pmem part then this
> gets smaller.
>

Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
time plan. :)

> Also the _DSM operations - I can't see them being in hypervisor - but only
> in the dom0 - which would have the right software to tickle the correct
> ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
> an SMART operation, etc).

Yes, it is reasonable to put it in dom 0 and it makes management tools happy.

>
>> a) ACPI NFIT interpretation
>>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>     base information of NVDIMM devices which includes PMEM info, PBLK
>>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>     by one.
>
> And it is a static table. As in part of the MADT.

Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
if a nvdimm device is hotpluged, please see below.

>>
>>     PMEM and PBLK are two modes to access NVDIMM devices:
>>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>        space so that CPU can r/w it directly.
>>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>        only offers two windows which are mapped to CPU's address space, the data
>>        window and access window, so that CPU can use these two windows to access
>>        the whole NVDIMM device.
>>
>>     NVDIMM device is interleaved whose info is also exported so that we can
>>     calculate the address to access the specified NVDIMM device.
>
> Right, along with the serial numbers.
>>
>>     NVDIMM devices from different vendor can have different function so that the
>>     vendor info is exported by NFIT to make vendor's driver work.
>
> via _DSM right?

Yes.

>>
>> b) ACPI SSDT interpretation
>>     SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>>     health check etc and hotplug support.
>
> Sounds like the control domain (dom0) would be in charge of that.

Yup. Dom0 is a better place to handle it.

>>
>> c) Resource management
>>     NVDIMM resource management challenged as:
>>     1) PMEM is huge and it is little slower access than RAM so it is not suitable
>>        to manage it as page struct (i think it is not a big problem in Xen
>>        hypervisor?)
>>     2) need to partition it to it be used in multiple VMs.
>>     3) need to support PBLK and partition it in the future.
>
> That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.

Sure, so let dom0 handle this is better, we are on the same page. :)

>>
>> d) management tools support
>>     S.M.A.R.T? error detection and recovering?
>>
>> c) hotplug support
>
> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> to scan. That would require the hypervisor also reading this for it to
> update it's data-structures.

Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
the better place handing this case too.

>>
>> d) third parts drivers
>>     Vendor drivers need to be ported to xen hypervisor and let it be supported in
>>     the management tool.
>
> Ewww.
>
> I presume the 'third party drivers' mean more interesting _DSM features right?

Yes.

> On the base level the firmware with this type of NVDIMM would still have
> the basic - ACPI NFIT + E820_NVDIMM (optional).
>>

Yes.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 16:25                     ` Xiao Guangrong
@ 2016-01-20 16:47                       ` Konrad Rzeszutek Wilk
  2016-01-20 16:55                         ` Xiao Guangrong
  0 siblings, 1 reply; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 16:47 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Jun Nakajima, Keir Fraser

On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
> >On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
> >>
> >>
> >>On 01/20/2016 07:20 PM, Jan Beulich wrote:
> >>>>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> >>>>On 01/20/16 01:46, Jan Beulich wrote:
> >>>>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >>>>>>Secondly, the driver implements a convenient block device interface to
> >>>>>>let software access areas where NVDIMM devices are mapped. The
> >>>>>>existing vNVDIMM implementation in QEMU uses this interface.
> >>>>>>
> >>>>>>As Linux NVDIMM driver has already done above, why do we bother to
> >>>>>>reimplement them in Xen?
> >>>>>
> >>>>>See above; a possibility is that we may need a split model (block
> >>>>>layer parts on Dom0, "normal memory" parts in the hypervisor.
> >>>>>Iirc the split is being determined by firmware, and hence set in
> >>>>>stone by the time OS (or hypervisor) boot starts.
> >>>>
> >>>>For the "normal memory" parts, do you mean parts that map the host
> >>>>NVDIMM device's address space range to the guest? I'm going to
> >>>>implement that part in hypervisor and expose it as a hypercall so that
> >>>>it can be used by QEMU.
> >>>
> >>>To answer this I need to have my understanding of the partitioning
> >>>being done by firmware confirmed: If that's the case, then "normal"
> >>>means the part that doesn't get exposed as a block device (SSD).
> >>>In any event there's no correlation to guest exposure here.
> >>
> >>Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> >>by OS.
> >>
> >>Actually, there are lots of things we should take into account if we move
> >>the NVDIMM management to hypervisor:
> >
> >If you remove the block device part and just deal with pmem part then this
> >gets smaller.
> >
> 
> Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
> time plan. :)
> 
> >Also the _DSM operations - I can't see them being in hypervisor - but only
> >in the dom0 - which would have the right software to tickle the correct
> >ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
> >an SMART operation, etc).
> 
> Yes, it is reasonable to put it in dom 0 and it makes management tools happy.
> 
> >
> >>a) ACPI NFIT interpretation
> >>    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
> >>    base information of NVDIMM devices which includes PMEM info, PBLK
> >>    info, nvdimm device interleave, vendor info, etc. Let me explain it one
> >>    by one.
> >
> >And it is a static table. As in part of the MADT.
> 
> Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
> if a nvdimm device is hotpluged, please see below.
> 
> >>
> >>    PMEM and PBLK are two modes to access NVDIMM devices:
> >>    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
> >>       space so that CPU can r/w it directly.
> >>    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
> >>       only offers two windows which are mapped to CPU's address space, the data
> >>       window and access window, so that CPU can use these two windows to access
> >>       the whole NVDIMM device.
> >>
> >>    NVDIMM device is interleaved whose info is also exported so that we can
> >>    calculate the address to access the specified NVDIMM device.
> >
> >Right, along with the serial numbers.
> >>
> >>    NVDIMM devices from different vendor can have different function so that the
> >>    vendor info is exported by NFIT to make vendor's driver work.
> >
> >via _DSM right?
> 
> Yes.
> 
> >>
> >>b) ACPI SSDT interpretation
> >>    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
> >>    health check etc and hotplug support.
> >
> >Sounds like the control domain (dom0) would be in charge of that.
> 
> Yup. Dom0 is a better place to handle it.
> 
> >>
> >>c) Resource management
> >>    NVDIMM resource management challenged as:
> >>    1) PMEM is huge and it is little slower access than RAM so it is not suitable
> >>       to manage it as page struct (i think it is not a big problem in Xen
> >>       hypervisor?)
> >>    2) need to partition it to it be used in multiple VMs.
> >>    3) need to support PBLK and partition it in the future.
> >
> >That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
> 
> Sure, so let dom0 handle this is better, we are on the same page. :)
> 
> >>
> >>d) management tools support
> >>    S.M.A.R.T? error detection and recovering?
> >>
> >>c) hotplug support
> >
> >How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >to scan. That would require the hypervisor also reading this for it to
> >update it's data-structures.
> 
> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> the better place handing this case too.

That one is a bit difficult. Both the OS and the hypervisor would need to know about
this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
the hypervisor needs to be told so it can slurp it up.

However I don't know if the hypervisor needs to know all the details of an
NVDIMM - or just the starting and ending ranges so that when an guest is created
and the VT-d is constructed - it can be assured that the ranges are valid.

I am not an expert on the P2M code - but I think that would need to be looked
at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.

> 
> >>
> >>d) third parts drivers
> >>    Vendor drivers need to be ported to xen hypervisor and let it be supported in
> >>    the management tool.
> >
> >Ewww.
> >
> >I presume the 'third party drivers' mean more interesting _DSM features right?
> 
> Yes.
> 
> >On the base level the firmware with this type of NVDIMM would still have
> >the basic - ACPI NFIT + E820_NVDIMM (optional).
> >>
> 
> Yes.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 16:47                       ` Konrad Rzeszutek Wilk
@ 2016-01-20 16:55                         ` Xiao Guangrong
  2016-01-20 17:18                           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 16:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Keir Fraser



On 01/21/2016 12:47 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote:
>>
>>
>> On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>>>>>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>>>>>> Secondly, the driver implements a convenient block device interface to
>>>>>>>> let software access areas where NVDIMM devices are mapped. The
>>>>>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>>>>>
>>>>>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>>>>>> reimplement them in Xen?
>>>>>>>
>>>>>>> See above; a possibility is that we may need a split model (block
>>>>>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>>>>>> Iirc the split is being determined by firmware, and hence set in
>>>>>>> stone by the time OS (or hypervisor) boot starts.
>>>>>>
>>>>>> For the "normal memory" parts, do you mean parts that map the host
>>>>>> NVDIMM device's address space range to the guest? I'm going to
>>>>>> implement that part in hypervisor and expose it as a hypercall so that
>>>>>> it can be used by QEMU.
>>>>>
>>>>> To answer this I need to have my understanding of the partitioning
>>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>>> means the part that doesn't get exposed as a block device (SSD).
>>>>> In any event there's no correlation to guest exposure here.
>>>>
>>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>>> by OS.
>>>>
>>>> Actually, there are lots of things we should take into account if we move
>>>> the NVDIMM management to hypervisor:
>>>
>>> If you remove the block device part and just deal with pmem part then this
>>> gets smaller.
>>>
>>
>> Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
>> time plan. :)
>>
>>> Also the _DSM operations - I can't see them being in hypervisor - but only
>>> in the dom0 - which would have the right software to tickle the correct
>>> ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
>>> an SMART operation, etc).
>>
>> Yes, it is reasonable to put it in dom 0 and it makes management tools happy.
>>
>>>
>>>> a) ACPI NFIT interpretation
>>>>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>>     base information of NVDIMM devices which includes PMEM info, PBLK
>>>>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>>     by one.
>>>
>>> And it is a static table. As in part of the MADT.
>>
>> Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
>> if a nvdimm device is hotpluged, please see below.
>>
>>>>
>>>>     PMEM and PBLK are two modes to access NVDIMM devices:
>>>>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>>        space so that CPU can r/w it directly.
>>>>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>>        only offers two windows which are mapped to CPU's address space, the data
>>>>        window and access window, so that CPU can use these two windows to access
>>>>        the whole NVDIMM device.
>>>>
>>>>     NVDIMM device is interleaved whose info is also exported so that we can
>>>>     calculate the address to access the specified NVDIMM device.
>>>
>>> Right, along with the serial numbers.
>>>>
>>>>     NVDIMM devices from different vendor can have different function so that the
>>>>     vendor info is exported by NFIT to make vendor's driver work.
>>>
>>> via _DSM right?
>>
>> Yes.
>>
>>>>
>>>> b) ACPI SSDT interpretation
>>>>     SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>>>>     health check etc and hotplug support.
>>>
>>> Sounds like the control domain (dom0) would be in charge of that.
>>
>> Yup. Dom0 is a better place to handle it.
>>
>>>>
>>>> c) Resource management
>>>>     NVDIMM resource management challenged as:
>>>>     1) PMEM is huge and it is little slower access than RAM so it is not suitable
>>>>        to manage it as page struct (i think it is not a big problem in Xen
>>>>        hypervisor?)
>>>>     2) need to partition it to it be used in multiple VMs.
>>>>     3) need to support PBLK and partition it in the future.
>>>
>>> That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
>>
>> Sure, so let dom0 handle this is better, we are on the same page. :)
>>
>>>>
>>>> d) management tools support
>>>>     S.M.A.R.T? error detection and recovering?
>>>>
>>>> c) hotplug support
>>>
>>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
>>> to scan. That would require the hypervisor also reading this for it to
>>> update it's data-structures.
>>
>> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
>> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
>> the better place handing this case too.
>
> That one is a bit difficult. Both the OS and the hypervisor would need to know about
> this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> the hypervisor needs to be told so it can slurp it up.

Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
the irq info out and tell hypervior to pass the irq to dom0, it is doable?

>
> However I don't know if the hypervisor needs to know all the details of an
> NVDIMM - or just the starting and ending ranges so that when an guest is created
> and the VT-d is constructed - it can be assured that the ranges are valid.
>
> I am not an expert on the P2M code - but I think that would need to be looked
> at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.

We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
lable support (namespace)...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:29                 ` Xiao Guangrong
  2016-01-20 15:47                   ` Konrad Rzeszutek Wilk
@ 2016-01-20 17:07                   ` Jan Beulich
  2016-01-20 17:17                     ` Xiao Guangrong
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-20 17:07 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>> To answer this I need to have my understanding of the partitioning
>> being done by firmware confirmed: If that's the case, then "normal"
>> means the part that doesn't get exposed as a block device (SSD).
>> In any event there's no correlation to guest exposure here.
> 
> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> by OS.
> 
> Actually, there are lots of things we should take into account if we move
> the NVDIMM management to hypervisor:
> a) ACPI NFIT interpretation
>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>     base information of NVDIMM devices which includes PMEM info, PBLK
>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>     by one.
> 
>     PMEM and PBLK are two modes to access NVDIMM devices:
>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>        space so that CPU can r/w it directly.
>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>        only offers two windows which are mapped to CPU's address space, the data
>        window and access window, so that CPU can use these two windows to access
>        the whole NVDIMM device.

You fail to mention PBLK. The question above really was about what
entity controls which of the two modes get used (and perhaps for
which parts of the overall NVDIMM).

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 17:07                   ` Jan Beulich
@ 2016-01-20 17:17                     ` Xiao Guangrong
  2016-01-21  8:18                       ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 17:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser



On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>> To answer this I need to have my understanding of the partitioning
>>> being done by firmware confirmed: If that's the case, then "normal"
>>> means the part that doesn't get exposed as a block device (SSD).
>>> In any event there's no correlation to guest exposure here.
>>
>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>> by OS.
>>
>> Actually, there are lots of things we should take into account if we move
>> the NVDIMM management to hypervisor:
>> a) ACPI NFIT interpretation
>>      A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>      base information of NVDIMM devices which includes PMEM info, PBLK
>>      info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>      by one.
>>
>>      PMEM and PBLK are two modes to access NVDIMM devices:
>>      1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>         space so that CPU can r/w it directly.
>>      2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>         only offers two windows which are mapped to CPU's address space, the data
>>         window and access window, so that CPU can use these two windows to access
>>         the whole NVDIMM device.
>
> You fail to mention PBLK. The question above really was about what

The 2) is PBLK.

> entity controls which of the two modes get used (and perhaps for
> which parts of the overall NVDIMM).

So i think the "normal" you mentioned is about PMEM. :)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 16:55                         ` Xiao Guangrong
@ 2016-01-20 17:18                           ` Konrad Rzeszutek Wilk
  2016-01-20 17:23                             ` Xiao Guangrong
  2016-01-21  3:12                             ` Haozhong Zhang
  0 siblings, 2 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 17:18 UTC (permalink / raw)
  To: Xiao Guangrong, feng.wu
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Jun Nakajima, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Keir Fraser

> >>>>c) hotplug support
> >>>
> >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >>>to scan. That would require the hypervisor also reading this for it to
> >>>update it's data-structures.
> >>
> >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> >>the better place handing this case too.
> >
> >That one is a bit difficult. Both the OS and the hypervisor would need to know about
> >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> >the hypervisor needs to be told so it can slurp it up.
> 
> Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0

Yes of course it can.
> handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> 
> >
> >However I don't know if the hypervisor needs to know all the details of an
> >NVDIMM - or just the starting and ending ranges so that when an guest is created
> >and the VT-d is constructed - it can be assured that the ranges are valid.
> >
> >I am not an expert on the P2M code - but I think that would need to be looked
> >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> 
> We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> lable support (namespace)...

<hand-waves> I don't know what QEMU does for guests? I naively assumed it would
create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
the _DSM).

Either way what I think you need to investigate is what is neccessary for the
Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
needs to the _FIT and NFIT tables.

(Adding Feng Wu, the VT-d maintainer).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 17:18                           ` Konrad Rzeszutek Wilk
@ 2016-01-20 17:23                             ` Xiao Guangrong
  2016-01-20 17:48                               ` Konrad Rzeszutek Wilk
  2016-01-21  3:12                             ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-20 17:23 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, feng.wu
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Jun Nakajima, Keir Fraser



On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> c) hotplug support
>>>>>
>>>>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
>>>>> to scan. That would require the hypervisor also reading this for it to
>>>>> update it's data-structures.
>>>>
>>>> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
>>>> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
>>>> the better place handing this case too.
>>>
>>> That one is a bit difficult. Both the OS and the hypervisor would need to know about
>>> this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
>>> the hypervisor needs to be told so it can slurp it up.
>>
>> Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
>
> Yes of course it can.
>> handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
>> the irq info out and tell hypervior to pass the irq to dom0, it is doable?
>>
>>>
>>> However I don't know if the hypervisor needs to know all the details of an
>>> NVDIMM - or just the starting and ending ranges so that when an guest is created
>>> and the VT-d is constructed - it can be assured that the ranges are valid.
>>>
>>> I am not an expert on the P2M code - but I think that would need to be looked
>>> at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
>>
>> We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
>> lable support (namespace)...
>
> <hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> the _DSM).

Ah, ACPI eliminates this E820 entry.

>
> Either way what I think you need to investigate is what is neccessary for the
> Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> needs to the _FIT and NFIT tables.
>

Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this
kind of NVDIMM usage?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 17:23                             ` Xiao Guangrong
@ 2016-01-20 17:48                               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-20 17:48 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, feng.wu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Jun Nakajima, Wei Liu, Keir Fraser

On Thu, Jan 21, 2016 at 01:23:31AM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote:
> >>>>>>c) hotplug support
> >>>>>
> >>>>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >>>>>to scan. That would require the hypervisor also reading this for it to
> >>>>>update it's data-structures.
> >>>>
> >>>>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> >>>>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> >>>>the better place handing this case too.
> >>>
> >>>That one is a bit difficult. Both the OS and the hypervisor would need to know about
> >>>this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> >>>the hypervisor needs to be told so it can slurp it up.
> >>
> >>Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
> >
> >Yes of course it can.
> >>handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> >>the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> >>
> >>>
> >>>However I don't know if the hypervisor needs to know all the details of an
> >>>NVDIMM - or just the starting and ending ranges so that when an guest is created
> >>>and the VT-d is constructed - it can be assured that the ranges are valid.
> >>>
> >>>I am not an expert on the P2M code - but I think that would need to be looked
> >>>at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> >>
> >>We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> >>lable support (namespace)...
> >
> ><hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> >create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> >the _DSM).
> 
> Ah, ACPI eliminates this E820 entry.
> 
> >
> >Either way what I think you need to investigate is what is neccessary for the
> >Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> >the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> >needs to the _FIT and NFIT tables.
> >
> 
> Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this
> kind of NVDIMM usage?

An easy one is iSCSI target. You could have an SR-IOV NIC that would have TCM
enabled (CONFIG_TCM_FILEIO or CONFIG_TCM_IBLOCK). Mount an file on the /dev/pmem0
(using DAX enabled FS) and export it as iSCSI LUN. The traffic would go over an SR-IOV NIC.

The DMA transactions would be SR-IOV NIC <-> NVDIMM.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:05                         ` Stefano Stabellini
@ 2016-01-20 18:14                           ` Andrew Cooper
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Cooper @ 2016-01-20 18:14 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Jun Nakajima, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On 20/01/16 15:05, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>> On 20/01/16 14:29, Stefano Stabellini wrote:
>>> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>>>>
>>> I wouldn't encourage the introduction of anything else that requires
>>> root privileges in QEMU. With QEMU running as non-root by default in
>>> 4.7, the feature will not be available unless users explicitly ask to
>>> run QEMU as root (which they shouldn't really).
>> This isn't how design works.
>>
>> First, design a feature in an architecturally correct way, and then
>> design an security policy to fit.
>>
>> We should not stunt design based on an existing implementation.  In
>> particular, if design shows that being a root only feature is the only
>> sane way of doing this, it should be a root only feature.  (I hope this
>> is not the case, but it shouldn't cloud the judgement of a design).
> I would argue that security is an integral part of the architecture and
> should not be retrofitted into it.

There is no retrofitting - it is all part of the same overall design
before coding starts happen.

>
> Is it really a good design if the only sane way to implement it is
> making it a root-only feature? I think not.

Then you have missed the point.

If you fail at architecting the feature in the first place, someone else
is going to have to come along and reimplement it properly, then provide
some form of compatibility with the old one.

Security is an important consideration in the design; I do not wish to
understate that.  However, if the only way for a feature to be
architected properly is for the feature to be a root-only feature, then
it should be a root-only feature.

>  Designing security policies
> for pieces of software that don't have the infrastructure for them is
> costly and that cost should be accounted as part of the overall cost of
> the solution rather than added to it in a second stage.

That cost is far better spent designing it properly in the first place,
rather than having to come along and reimplement a v2 because v1 was broken.

>
>
>> (note, both before implement happens).
> That is ideal but realistically in many cases nobody is able to produce
> a design before the implementation happens.

It is perfectly easy.  This is the difference between software
engineering and software hacking.

There has been a lot of positive feedback from on-list design
documents.  It is a trend which needs to continue.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 17:18                           ` Konrad Rzeszutek Wilk
  2016-01-20 17:23                             ` Xiao Guangrong
@ 2016-01-21  3:12                             ` Haozhong Zhang
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-21  3:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Kevin Tian, feng.wu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

On 01/20/16 12:18, Konrad Rzeszutek Wilk wrote:
> > >>>>c) hotplug support
> > >>>
> > >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> > >>>to scan. That would require the hypervisor also reading this for it to
> > >>>update it's data-structures.
> > >>
> > >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> > >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> > >>the better place handing this case too.
> > >
> > >That one is a bit difficult. Both the OS and the hypervisor would need to know about
> > >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> > >the hypervisor needs to be told so it can slurp it up.
> > 
> > Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
> 
> Yes of course it can.
> > handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> > the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> > 
> > >
> > >However I don't know if the hypervisor needs to know all the details of an
> > >NVDIMM - or just the starting and ending ranges so that when an guest is created
> > >and the VT-d is constructed - it can be assured that the ranges are valid.
> > >
> > >I am not an expert on the P2M code - but I think that would need to be looked
> > >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> > 
> > We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> > lable support (namespace)...
> 
> <hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> the _DSM).
>

ACPI 6 defines E820 type 7 for pmem (see table 15-312 in Section 15)
and legacy ones may use the non-standard type 12 (and even older ones
may use type 6, but linux does not consider type 6 any more), but
hot-plugged NVDIMM may not appear in E820. Still think it's better to
let dom0 linux that already has enough drivers handle all these device
probing tasks.

> Either way what I think you need to investigate is what is neccessary for the
> Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> needs to the _FIT and NFIT tables.
>
> (Adding Feng Wu, the VT-d maintainer).

I haven't considered VT-d at all. From your example in another reply,
it looks like that VT-d code needs to be aware of the address space
range of NVDIMM, otherwise that example would not work. If so, maybe
we can let dom0 linux kernel report the address space ranges of
detected NVDIMM devices to Xen hypervisor. Anyway, I'll investigate
this issue.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 15:41                               ` Konrad Rzeszutek Wilk
  2016-01-20 15:54                                 ` Haozhong Zhang
@ 2016-01-21  3:35                                 ` Bob Liu
  1 sibling, 0 replies; 88+ messages in thread
From: Bob Liu @ 2016-01-21  3:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich, Jun Nakajima,
	Xiao Guangrong, Keir Fraser


On 01/20/2016 11:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> Neither of these are sufficient however.  That gets Qemu a mapping of
>>>>> the NVDIMM, not the guest.  Something, one way or another, has to turn
>>>>> this into appropriate add-to-phymap hypercalls.
>>>>>
>>>>
>>>> Yes, those hypercalls are what I'm going to add.
>>>
>>> Why?
>>>
>>> What you need (in a rought hand-wave way) is to:
>>>  - mount /dev/pmem0
>>>  - mmap the file on /dev/pmem0 FS
>>>  - walk the VMA for the file - extract the MFN (machien frame numbers)
>>

If I understand right, in this case the MFN is the block layout of the DAX-file?
If we find all the file blocks, then we get all the MFN.

>> Can this step be done by QEMU? Or does linux kernel provide some
>> approach for the userspace to do the translation?
> 

The ioctl(fd, FIBMAP, &block) may help, which can get the LBAs that a given file occupies. 

-Bob

> I don't know. I would think no - as you wouldn't want the userspace
> application to figure out the physical frames from the virtual
> address (unless they are root). But then if you look in
> /proc/<pid>/maps and /proc/<pid>/smaps there are some data there.
> 
> Hm, /proc/<pid>/pagemaps has something intersting
> 
> See pagemap_read function. That looks to be doing it?
> 
>>
>> Haozhong
>>
>>>  - feed those frame numbers to xc_memory_mapping hypercall. The
>>>    guest pfns would be contingous.
>>>    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
>>>    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
>>>
>>>    However the MFNs may be discontingous as the NVDIMM could be an
>>>    1TB - and the 8GB file is scattered all over.
>>>
>>> I believe that is all you would need to do?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-20 17:17                     ` Xiao Guangrong
@ 2016-01-21  8:18                       ` Jan Beulich
  2016-01-21  8:25                         ` Xiao Guangrong
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-21  8:18 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote:

> 
> On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>> To answer this I need to have my understanding of the partitioning
>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>> means the part that doesn't get exposed as a block device (SSD).
>>>> In any event there's no correlation to guest exposure here.
>>>
>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>> by OS.
>>>
>>> Actually, there are lots of things we should take into account if we move
>>> the NVDIMM management to hypervisor:
>>> a) ACPI NFIT interpretation
>>>      A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>      base information of NVDIMM devices which includes PMEM info, PBLK
>>>      info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>      by one.
>>>
>>>      PMEM and PBLK are two modes to access NVDIMM devices:
>>>      1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>         space so that CPU can r/w it directly.
>>>      2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>         only offers two windows which are mapped to CPU's address space, the data
>>>         window and access window, so that CPU can use these two windows to access
>>>         the whole NVDIMM device.
>>
>> You fail to mention PBLK. The question above really was about what
> 
> The 2) is PBLK.
> 
>> entity controls which of the two modes get used (and perhaps for
>> which parts of the overall NVDIMM).
> 
> So i think the "normal" you mentioned is about PMEM. :)

Yes. But then - other than you said above - it still looks to me as
if the split between PMEM and PBLK is arranged for by firmware?

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  8:18                       ` Jan Beulich
@ 2016-01-21  8:25                         ` Xiao Guangrong
  2016-01-21  8:53                           ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-21  8:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser



On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote:
>
>>
>> On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>> To answer this I need to have my understanding of the partitioning
>>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>>> means the part that doesn't get exposed as a block device (SSD).
>>>>> In any event there's no correlation to guest exposure here.
>>>>
>>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>>> by OS.
>>>>
>>>> Actually, there are lots of things we should take into account if we move
>>>> the NVDIMM management to hypervisor:
>>>> a) ACPI NFIT interpretation
>>>>       A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>>       base information of NVDIMM devices which includes PMEM info, PBLK
>>>>       info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>>       by one.
>>>>
>>>>       PMEM and PBLK are two modes to access NVDIMM devices:
>>>>       1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>>          space so that CPU can r/w it directly.
>>>>       2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>>          only offers two windows which are mapped to CPU's address space, the data
>>>>          window and access window, so that CPU can use these two windows to access
>>>>          the whole NVDIMM device.
>>>
>>> You fail to mention PBLK. The question above really was about what
>>
>> The 2) is PBLK.
>>
>>> entity controls which of the two modes get used (and perhaps for
>>> which parts of the overall NVDIMM).
>>
>> So i think the "normal" you mentioned is about PMEM. :)
>
> Yes. But then - other than you said above - it still looks to me as
> if the split between PMEM and PBLK is arranged for by firmware?

Yes. But OS/Hypervisor is not excepted to dynamically change its configure (re-split),
i,e, for PoV of OS/Hypervisor, it is static.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  8:25                         ` Xiao Guangrong
@ 2016-01-21  8:53                           ` Jan Beulich
  2016-01-21  9:10                             ` Xiao Guangrong
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-21  8:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>> Yes. But then - other than you said above - it still looks to me as
>> if the split between PMEM and PBLK is arranged for by firmware?
> 
> Yes. But OS/Hypervisor is not excepted to dynamically change its configure 
> (re-split),
> i,e, for PoV of OS/Hypervisor, it is static.

Exactly, that has been my understanding. And hence the PMEM part
could be under the hypervisor's control, while the PBLK part could be
Dom0's responsibility.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  8:53                           ` Jan Beulich
@ 2016-01-21  9:10                             ` Xiao Guangrong
  2016-01-21  9:29                               ` Andrew Cooper
  2016-01-21 10:25                               ` Jan Beulich
  0 siblings, 2 replies; 88+ messages in thread
From: Xiao Guangrong @ 2016-01-21  9:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser



On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>> Yes. But then - other than you said above - it still looks to me as
>>> if the split between PMEM and PBLK is arranged for by firmware?
>>
>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
>> (re-split),
>> i,e, for PoV of OS/Hypervisor, it is static.
>
> Exactly, that has been my understanding. And hence the PMEM part
> could be under the hypervisor's control, while the PBLK part could be
> Dom0's responsibility.
>

I am not sure if i have understood your point. What your suggestion is that
leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
Dom0? If yes, we should:
a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
    interpret ACPI SSDT/DSDT.
b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
    handle them in hypervisor.
c) hypervisor should mange PMEM resource pool and partition it to multiple
    VMs.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  9:10                             ` Xiao Guangrong
@ 2016-01-21  9:29                               ` Andrew Cooper
  2016-01-21 10:26                                 ` Jan Beulich
  2016-01-21 10:25                               ` Jan Beulich
  1 sibling, 1 reply; 88+ messages in thread
From: Andrew Cooper @ 2016-01-21  9:29 UTC (permalink / raw)
  To: Xiao Guangrong, Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Ian Jackson, xen-devel, Jun Nakajima,
	Keir Fraser

On 21/01/16 09:10, Xiao Guangrong wrote:
>
>
> On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> Yes. But then - other than you said above - it still looks to me as
>>>> if the split between PMEM and PBLK is arranged for by firmware?
>>>
>>> Yes. But OS/Hypervisor is not excepted to dynamically change its
>>> configure
>>> (re-split),
>>> i,e, for PoV of OS/Hypervisor, it is static.
>>
>> Exactly, that has been my understanding. And hence the PMEM part
>> could be under the hypervisor's control, while the PBLK part could be
>> Dom0's responsibility.
>>
>
> I am not sure if i have understood your point. What your suggestion is
> that
> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> Dom0? If yes, we should:
> a) handle hotplug in hypervisor (new PMEM add/remove) that causes
> hyperivsor
>    interpret ACPI SSDT/DSDT.
> b) some _DSMs control PMEM so you should filter out these kind of
> _DSMs and
>    handle them in hypervisor.
> c) hypervisor should mange PMEM resource pool and partition it to
> multiple
>    VMs.

It is not possible for Xen to handle ACPI such as this.

There can only be one OSPM on a system, and 9/10ths of the functionality
needing it already lives in Dom0.

The only rational course of action is for Xen to treat both PBLK and
PMEM as "devices" and leave them in Dom0's hands.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  9:10                             ` Xiao Guangrong
  2016-01-21  9:29                               ` Andrew Cooper
@ 2016-01-21 10:25                               ` Jan Beulich
  2016-01-21 14:01                                 ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-21 10:25 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Keir Fraser

>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> Yes. But then - other than you said above - it still looks to me as
>>>> if the split between PMEM and PBLK is arranged for by firmware?
>>>
>>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
>>> (re-split),
>>> i,e, for PoV of OS/Hypervisor, it is static.
>>
>> Exactly, that has been my understanding. And hence the PMEM part
>> could be under the hypervisor's control, while the PBLK part could be
>> Dom0's responsibility.
>>
> 
> I am not sure if i have understood your point. What your suggestion is that
> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> Dom0? If yes, we should:
> a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
>     interpret ACPI SSDT/DSDT.

Why would this be different from ordinary memory hotplug, where
Dom0 deals with the ACPI CA interaction, notifying Xen about the
added memory?

> b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>     handle them in hypervisor.

Not if (see above) following the model we currently have in place.

> c) hypervisor should mange PMEM resource pool and partition it to multiple
>     VMs.

Yes.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21  9:29                               ` Andrew Cooper
@ 2016-01-21 10:26                                 ` Jan Beulich
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-21 10:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Ian Jackson, xen-devel, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

>>> On 21.01.16 at 10:29, <andrew.cooper3@citrix.com> wrote:
> On 21/01/16 09:10, Xiao Guangrong wrote:
>> I am not sure if i have understood your point. What your suggestion is
>> that
>> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
>> Dom0? If yes, we should:
>> a) handle hotplug in hypervisor (new PMEM add/remove) that causes
>> hyperivsor
>>    interpret ACPI SSDT/DSDT.
>> b) some _DSMs control PMEM so you should filter out these kind of
>> _DSMs and
>>    handle them in hypervisor.
>> c) hypervisor should mange PMEM resource pool and partition it to
>> multiple
>>    VMs.
> 
> It is not possible for Xen to handle ACPI such as this.
> 
> There can only be one OSPM on a system, and 9/10ths of the functionality
> needing it already lives in Dom0.
> 
> The only rational course of action is for Xen to treat both PBLK and
> PMEM as "devices" and leave them in Dom0's hands.

See my other reply: Why would this be different from "ordinary"
memory hotplug?

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21 10:25                               ` Jan Beulich
@ 2016-01-21 14:01                                 ` Haozhong Zhang
  2016-01-21 14:52                                   ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-21 14:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Xiao Guangrong, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima, Wei Liu,
	Keir Fraser

On 01/21/16 03:25, Jan Beulich wrote:
> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> > On 01/21/2016 04:53 PM, Jan Beulich wrote:
> >>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
> >>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
> >>>> Yes. But then - other than you said above - it still looks to me as
> >>>> if the split between PMEM and PBLK is arranged for by firmware?
> >>>
> >>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
> >>> (re-split),
> >>> i,e, for PoV of OS/Hypervisor, it is static.
> >>
> >> Exactly, that has been my understanding. And hence the PMEM part
> >> could be under the hypervisor's control, while the PBLK part could be
> >> Dom0's responsibility.
> >>
> > 
> > I am not sure if i have understood your point. What your suggestion is that
> > leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> > Dom0? If yes, we should:
> > a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
> >     interpret ACPI SSDT/DSDT.
> 
> Why would this be different from ordinary memory hotplug, where
> Dom0 deals with the ACPI CA interaction, notifying Xen about the
> added memory?
>

The process of NVDIMM hotplug is similar to the ordinary memory
hotplug, and seemingly possible to support it in Xen hypervisor like
ordinary memory hotplug.

> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
> >     handle them in hypervisor.
> 
> Not if (see above) following the model we currently have in place.
>

You mean let dom0 linux evaluates those _DSMs and interact with
hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?

> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >     VMs.
> 
> Yes.
>

But I Still do not quite understand this part: why must pmem resource
management and partition be done in hypervisor?

I mean if we allow the following steps of operations (for example)
(1) partition pmem in dom 0
(2) get address and size of each partition (part_addr, part_size)
(3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, gpfn) to
    map a partition to the address gpfn in dom d.
Only the last step requires hypervisor. Would anything be wrong if we
allow above operations?

Ha

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21 14:01                                 ` Haozhong Zhang
@ 2016-01-21 14:52                                   ` Jan Beulich
  2016-01-22  2:43                                     ` Haozhong Zhang
  2016-01-26 11:44                                     ` George Dunlap
  0 siblings, 2 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-21 14:52 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> On 01/21/16 03:25, Jan Beulich wrote:
>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>> >     handle them in hypervisor.
>> 
>> Not if (see above) following the model we currently have in place.
>>
> 
> You mean let dom0 linux evaluates those _DSMs and interact with
> hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?

Yes.

>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>> >     VMs.
>> 
>> Yes.
>>
> 
> But I Still do not quite understand this part: why must pmem resource
> management and partition be done in hypervisor?

Because that's where memory management belongs. And PMEM,
other than PBLK, is just another form of RAM.

> I mean if we allow the following steps of operations (for example)
> (1) partition pmem in dom 0
> (2) get address and size of each partition (part_addr, part_size)
> (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, 
> gpfn) to
>     map a partition to the address gpfn in dom d.
> Only the last step requires hypervisor. Would anything be wrong if we
> allow above operations?

The main issue is that this would imo be a layering violation. I'm
sure it can be made work, but that doesn't mean that's the way
it ought to work.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21 14:52                                   ` Jan Beulich
@ 2016-01-22  2:43                                     ` Haozhong Zhang
  2016-01-26 11:44                                     ` George Dunlap
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-22  2:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	Andrew Cooper, Ian Jackson, xen-devel, Jun Nakajima,
	Xiao Guangrong, Keir Fraser

On 01/21/16 07:52, Jan Beulich wrote:
> >>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> > On 01/21/16 03:25, Jan Beulich wrote:
> >> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
> >> >     handle them in hypervisor.
> >> 
> >> Not if (see above) following the model we currently have in place.
> >>
> > 
> > You mean let dom0 linux evaluates those _DSMs and interact with
> > hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?
> 
> Yes.
> 
> >> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >> >     VMs.
> >> 
> >> Yes.
> >>
> > 
> > But I Still do not quite understand this part: why must pmem resource
> > management and partition be done in hypervisor?
> 
> Because that's where memory management belongs. And PMEM,
> other than PBLK, is just another form of RAM.
> 
> > I mean if we allow the following steps of operations (for example)
> > (1) partition pmem in dom 0
> > (2) get address and size of each partition (part_addr, part_size)
> > (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, 
> > gpfn) to
> >     map a partition to the address gpfn in dom d.
> > Only the last step requires hypervisor. Would anything be wrong if we
> > allow above operations?
> 
> The main issue is that this would imo be a layering violation. I'm
> sure it can be made work, but that doesn't mean that's the way
> it ought to work.
> 
> Jan
> 

OK, then it makes sense to put them in hypervisor. I'll think about
this and note in the design document.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-21 14:52                                   ` Jan Beulich
  2016-01-22  2:43                                     ` Haozhong Zhang
@ 2016-01-26 11:44                                     ` George Dunlap
  2016-01-26 12:44                                       ` Jan Beulich
  1 sibling, 1 reply; 88+ messages in thread
From: George Dunlap @ 2016-01-26 11:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>> On 01/21/16 03:25, Jan Beulich wrote:
>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>>> >     handle them in hypervisor.
>>>
>>> Not if (see above) following the model we currently have in place.
>>>
>>
>> You mean let dom0 linux evaluates those _DSMs and interact with
>> hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?
>
> Yes.
>
>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>>> >     VMs.
>>>
>>> Yes.
>>>
>>
>> But I Still do not quite understand this part: why must pmem resource
>> management and partition be done in hypervisor?
>
> Because that's where memory management belongs. And PMEM,
> other than PBLK, is just another form of RAM.

I haven't looked more deeply into the details of this, but this
argument doesn't seem right to me.

Normal RAM in Xen is what might be called "fungible" -- at boot, all
RAM is zeroed, and it basically doesn't matter at all what RAM is
given to what guest.  (There are restrictions of course: lowmem for
DMA, contiguous superpages, &c; but within those groups, it doesn't
matter *which* bit of lowmem you get, as long as you get enough to do
your job.)  If you reboot your guest or hand RAM back to the
hypervisor, you assume that everything in it will disappear.  When you
ask for RAM, you can request some parameters that it will have
(lowmem, on a specific node, &c), but you can't request a specific
page that you had before.

This is not the case for PMEM.  The whole point of PMEM (correct me if
I'm wrong) is to be used for long-term storage that survives over
reboot.  It matters very much that a guest be given the same PRAM
after the host is rebooted that it was given before.  It doesn't make
any sense to manage it the way Xen currently manages RAM (i.e., that
you request a page and get whatever Xen happens to give you).

So if Xen is going to use PMEM, it will have to invent an entirely new
interface for guests, and it will have to keep track of those
resources across host reboots.  In other words, it will have to
duplicate all the work that Linux already does.  What do we gain from
that duplication?  Why not just leverage what's already implemented in
dom0?

>> I mean if we allow the following steps of operations (for example)
>> (1) partition pmem in dom 0
>> (2) get address and size of each partition (part_addr, part_size)
>> (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size,
>> gpfn) to
>>     map a partition to the address gpfn in dom d.
>> Only the last step requires hypervisor. Would anything be wrong if we
>> allow above operations?
>
> The main issue is that this would imo be a layering violation. I'm
> sure it can be made work, but that doesn't mean that's the way
> it ought to work.

Jan, from a toolstack <-> Xen perspective, I'm not sure what
alternative there to the interface above.  Won't the toolstack have to
1) figure out what nvdimm regions there are and 2) tell Xen how and
where to assign them to the guest no matter what we do?  And if we
want to assign arbitrary regions to arbitrary guests, then (part_addr,
part_size) and (gpfn) are going to be necessary bits of information.
The only difference would be whether part_addr is the machine address
or some abstracted address space (possibly starting at 0).

What does your ideal toolstack <-> Xen interface look like?

 -George

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 11:44                                     ` George Dunlap
@ 2016-01-26 12:44                                       ` Jan Beulich
  2016-01-26 12:54                                         ` Juergen Gross
                                                           ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-26 12:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>> >     VMs.
>>>>
>>>> Yes.
>>>>
>>>
>>> But I Still do not quite understand this part: why must pmem resource
>>> management and partition be done in hypervisor?
>>
>> Because that's where memory management belongs. And PMEM,
>> other than PBLK, is just another form of RAM.
> 
> I haven't looked more deeply into the details of this, but this
> argument doesn't seem right to me.
> 
> Normal RAM in Xen is what might be called "fungible" -- at boot, all
> RAM is zeroed, and it basically doesn't matter at all what RAM is
> given to what guest.  (There are restrictions of course: lowmem for
> DMA, contiguous superpages, &c; but within those groups, it doesn't
> matter *which* bit of lowmem you get, as long as you get enough to do
> your job.)  If you reboot your guest or hand RAM back to the
> hypervisor, you assume that everything in it will disappear.  When you
> ask for RAM, you can request some parameters that it will have
> (lowmem, on a specific node, &c), but you can't request a specific
> page that you had before.
> 
> This is not the case for PMEM.  The whole point of PMEM (correct me if
> I'm wrong) is to be used for long-term storage that survives over
> reboot.  It matters very much that a guest be given the same PRAM
> after the host is rebooted that it was given before.  It doesn't make
> any sense to manage it the way Xen currently manages RAM (i.e., that
> you request a page and get whatever Xen happens to give you).

Interesting. This isn't the usage model I have been thinking about
so far. Having just gone back to the original 0/4 mail, I'm afraid
we're really left guessing, and you guessed differently than I did.
My understanding of the intentions of PMEM so far was that this
is a high-capacity, slower than DRAM but much faster than e.g.
swapping to disk alternative to normal RAM. I.e. the persistent
aspect of it wouldn't matter at all in this case (other than for PBLK,
obviously).

However, thinking through your usage model I have problems
seeing it work in a reasonable way even with virtualization left
aside: To my knowledge there's no established protocol on how
multiple parties (different versions of the same OS, or even
completely different OSes) would arbitrate using such memory
ranges. And even for a single OS it is, other than for disks (and
hence PBLK), not immediately clear how it would communicate
from one boot to another what information got stored where,
or how it would react to some or all of this storage having
disappeared (just like a disk which got removed, which - unless
it held the boot partition - would normally have pretty little
effect on the OS coming back up).

> So if Xen is going to use PMEM, it will have to invent an entirely new
> interface for guests, and it will have to keep track of those
> resources across host reboots.  In other words, it will have to
> duplicate all the work that Linux already does.  What do we gain from
> that duplication?  Why not just leverage what's already implemented in
> dom0?

Indeed if my guessing on the intentions was wrong, then the
picture completely changes (also for the points you've made
further down).

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 12:44                                       ` Jan Beulich
@ 2016-01-26 12:54                                         ` Juergen Gross
  2016-01-26 14:44                                           ` Konrad Rzeszutek Wilk
  2016-01-26 13:58                                         ` George Dunlap
  2016-01-26 15:30                                         ` Haozhong Zhang
  2 siblings, 1 reply; 88+ messages in thread
From: Juergen Gross @ 2016-01-26 12:54 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 26/01/16 13:44, Jan Beulich wrote:
>>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>>>>     VMs.
>>>>>
>>>>> Yes.
>>>>>
>>>>
>>>> But I Still do not quite understand this part: why must pmem resource
>>>> management and partition be done in hypervisor?
>>>
>>> Because that's where memory management belongs. And PMEM,
>>> other than PBLK, is just another form of RAM.
>>
>> I haven't looked more deeply into the details of this, but this
>> argument doesn't seem right to me.
>>
>> Normal RAM in Xen is what might be called "fungible" -- at boot, all
>> RAM is zeroed, and it basically doesn't matter at all what RAM is
>> given to what guest.  (There are restrictions of course: lowmem for
>> DMA, contiguous superpages, &c; but within those groups, it doesn't
>> matter *which* bit of lowmem you get, as long as you get enough to do
>> your job.)  If you reboot your guest or hand RAM back to the
>> hypervisor, you assume that everything in it will disappear.  When you
>> ask for RAM, you can request some parameters that it will have
>> (lowmem, on a specific node, &c), but you can't request a specific
>> page that you had before.
>>
>> This is not the case for PMEM.  The whole point of PMEM (correct me if
>> I'm wrong) is to be used for long-term storage that survives over
>> reboot.  It matters very much that a guest be given the same PRAM
>> after the host is rebooted that it was given before.  It doesn't make
>> any sense to manage it the way Xen currently manages RAM (i.e., that
>> you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).
> 
> However, thinking through your usage model I have problems
> seeing it work in a reasonable way even with virtualization left
> aside: To my knowledge there's no established protocol on how
> multiple parties (different versions of the same OS, or even
> completely different OSes) would arbitrate using such memory
> ranges. And even for a single OS it is, other than for disks (and
> hence PBLK), not immediately clear how it would communicate
> from one boot to another what information got stored where,
> or how it would react to some or all of this storage having
> disappeared (just like a disk which got removed, which - unless
> it held the boot partition - would normally have pretty little
> effect on the OS coming back up).

Last year at Linux Plumbers Conference I attended a session dedicated
to NVDIMM support. I asked the very same question and the INTEL guy
there told me there is indeed something like a partition table meant
to describe the layout of the memory areas and their contents.

It would be nice to have a pointer to such information. Without anything
like this it might be rather difficult to find the best solution how to
implement NVDIMM support in Xen or any other product.


Juergen

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 12:44                                       ` Jan Beulich
  2016-01-26 12:54                                         ` Juergen Gross
@ 2016-01-26 13:58                                         ` George Dunlap
  2016-01-26 14:46                                           ` Konrad Rzeszutek Wilk
  2016-01-26 15:30                                         ` Haozhong Zhang
  2 siblings, 1 reply; 88+ messages in thread
From: George Dunlap @ 2016-01-26 13:58 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 26/01/16 12:44, Jan Beulich wrote:
>>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>>>>     VMs.
>>>>>
>>>>> Yes.
>>>>>
>>>>
>>>> But I Still do not quite understand this part: why must pmem resource
>>>> management and partition be done in hypervisor?
>>>
>>> Because that's where memory management belongs. And PMEM,
>>> other than PBLK, is just another form of RAM.
>>
>> I haven't looked more deeply into the details of this, but this
>> argument doesn't seem right to me.
>>
>> Normal RAM in Xen is what might be called "fungible" -- at boot, all
>> RAM is zeroed, and it basically doesn't matter at all what RAM is
>> given to what guest.  (There are restrictions of course: lowmem for
>> DMA, contiguous superpages, &c; but within those groups, it doesn't
>> matter *which* bit of lowmem you get, as long as you get enough to do
>> your job.)  If you reboot your guest or hand RAM back to the
>> hypervisor, you assume that everything in it will disappear.  When you
>> ask for RAM, you can request some parameters that it will have
>> (lowmem, on a specific node, &c), but you can't request a specific
>> page that you had before.
>>
>> This is not the case for PMEM.  The whole point of PMEM (correct me if
>> I'm wrong) is to be used for long-term storage that survives over
>> reboot.  It matters very much that a guest be given the same PRAM
>> after the host is rebooted that it was given before.  It doesn't make
>> any sense to manage it the way Xen currently manages RAM (i.e., that
>> you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).

Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
then you're right -- it is just another form of RAM, that should be
treated no differently than say, lowmem: a fungible resource that can be
requested by setting a flag.

Haozhong?

 -George

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 12:54                                         ` Juergen Gross
@ 2016-01-26 14:44                                           ` Konrad Rzeszutek Wilk
  2016-01-26 15:37                                             ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 14:44 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Jun Nakajima, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

> Last year at Linux Plumbers Conference I attended a session dedicated
> to NVDIMM support. I asked the very same question and the INTEL guy
> there told me there is indeed something like a partition table meant
> to describe the layout of the memory areas and their contents.

It is described in details at pmem.io, look at  Documents, see
http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.

Then I would recommend you read:
http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

followed by http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

And then for dessert:
https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
which explains it in more technical terms.
> 
> It would be nice to have a pointer to such information. Without anything
> like this it might be rather difficult to find the best solution how to
> implement NVDIMM support in Xen or any other product.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 13:58                                         ` George Dunlap
@ 2016-01-26 14:46                                           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 14:46 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jun Nakajima, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong, Keir Fraser

On Tue, Jan 26, 2016 at 01:58:35PM +0000, George Dunlap wrote:
> On 26/01/16 12:44, Jan Beulich wrote:
> >>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> >> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> >>>> On 01/21/16 03:25, Jan Beulich wrote:
> >>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
> >>>>>>     VMs.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>
> >>>> But I Still do not quite understand this part: why must pmem resource
> >>>> management and partition be done in hypervisor?
> >>>
> >>> Because that's where memory management belongs. And PMEM,
> >>> other than PBLK, is just another form of RAM.
> >>
> >> I haven't looked more deeply into the details of this, but this
> >> argument doesn't seem right to me.
> >>
> >> Normal RAM in Xen is what might be called "fungible" -- at boot, all
> >> RAM is zeroed, and it basically doesn't matter at all what RAM is
> >> given to what guest.  (There are restrictions of course: lowmem for
> >> DMA, contiguous superpages, &c; but within those groups, it doesn't
> >> matter *which* bit of lowmem you get, as long as you get enough to do
> >> your job.)  If you reboot your guest or hand RAM back to the
> >> hypervisor, you assume that everything in it will disappear.  When you
> >> ask for RAM, you can request some parameters that it will have
> >> (lowmem, on a specific node, &c), but you can't request a specific
> >> page that you had before.
> >>
> >> This is not the case for PMEM.  The whole point of PMEM (correct me if
> >> I'm wrong) is to be used for long-term storage that survives over
> >> reboot.  It matters very much that a guest be given the same PRAM
> >> after the host is rebooted that it was given before.  It doesn't make
> >> any sense to manage it the way Xen currently manages RAM (i.e., that
> >> you request a page and get whatever Xen happens to give you).
> > 
> > Interesting. This isn't the usage model I have been thinking about
> > so far. Having just gone back to the original 0/4 mail, I'm afraid
> > we're really left guessing, and you guessed differently than I did.
> > My understanding of the intentions of PMEM so far was that this
> > is a high-capacity, slower than DRAM but much faster than e.g.
> > swapping to disk alternative to normal RAM. I.e. the persistent
> > aspect of it wouldn't matter at all in this case (other than for PBLK,
> > obviously).
> 
> Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> then you're right -- it is just another form of RAM, that should be
> treated no differently than say, lowmem: a fungible resource that can be
> requested by setting a flag.

I would think of it as MMIO ranges than RAM. Yes it is behind an MMC - but
there are subtle things such as the new instructions - pcommit, clfushopt,
and other that impact it.

Furthermore ranges (contingous and most likely discontingous)
of this  "RAM" has to be shared with guests (at least dom0)
and with other (multiple HVM guests).


> 
> Haozhong?
> 
>  -George
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 12:44                                       ` Jan Beulich
  2016-01-26 12:54                                         ` Juergen Gross
  2016-01-26 13:58                                         ` George Dunlap
@ 2016-01-26 15:30                                         ` Haozhong Zhang
  2016-01-26 15:33                                           ` Haozhong Zhang
  2016-01-26 15:57                                           ` Jan Beulich
  2 siblings, 2 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-26 15:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 01/26/16 05:44, Jan Beulich wrote:
> >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> >>> On 01/21/16 03:25, Jan Beulich wrote:
> >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >>>> >     VMs.
> >>>>
> >>>> Yes.
> >>>>
> >>>
> >>> But I Still do not quite understand this part: why must pmem resource
> >>> management and partition be done in hypervisor?
> >>
> >> Because that's where memory management belongs. And PMEM,
> >> other than PBLK, is just another form of RAM.
> > 
> > I haven't looked more deeply into the details of this, but this
> > argument doesn't seem right to me.
> > 
> > Normal RAM in Xen is what might be called "fungible" -- at boot, all
> > RAM is zeroed, and it basically doesn't matter at all what RAM is
> > given to what guest.  (There are restrictions of course: lowmem for
> > DMA, contiguous superpages, &c; but within those groups, it doesn't
> > matter *which* bit of lowmem you get, as long as you get enough to do
> > your job.)  If you reboot your guest or hand RAM back to the
> > hypervisor, you assume that everything in it will disappear.  When you
> > ask for RAM, you can request some parameters that it will have
> > (lowmem, on a specific node, &c), but you can't request a specific
> > page that you had before.
> > 
> > This is not the case for PMEM.  The whole point of PMEM (correct me if
> > I'm wrong) is to be used for long-term storage that survives over
> > reboot.  It matters very much that a guest be given the same PRAM
> > after the host is rebooted that it was given before.  It doesn't make
> > any sense to manage it the way Xen currently manages RAM (i.e., that
> > you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).
>

Of course, pmem could be used in the way you thought because of its
'ram' aspect. But I think the more meaningful usage is from its
persistent aspect. For example, the implementation of some journal
file systems could store logs in pmem rather than the normal ram, so
that if a power failure happens before those in-memory logs are
completely written to the disk, there would still be chance to restore
them from pmem after next booting (rather than abandoning all of
them).

(I'm still writing the design doc which will include more details of
underlying hardware and the software interface of nvdimm exposed by
current linux)

> However, thinking through your usage model I have problems
> seeing it work in a reasonable way even with virtualization left
> aside: To my knowledge there's no established protocol on how
> multiple parties (different versions of the same OS, or even
> completely different OSes) would arbitrate using such memory
> ranges. And even for a single OS it is, other than for disks (and
> hence PBLK), not immediately clear how it would communicate
> from one boot to another what information got stored where,
> or how it would react to some or all of this storage having
> disappeared (just like a disk which got removed, which - unless
> it held the boot partition - would normally have pretty little
> effect on the OS coming back up).
>

Label storage area is a persistent area on NVDIMM and can be used to
store partitions information. It's not included in pmem (that part
that is mapped into the system address space). Instead, it can be only
accessed through NVDIMM _DSM method [1]. However, what contents are
stored and how they are interpreted are left to software. One way is
to follow NVDIMM Namespace Specification [2] to store an array of
labels that describe the start address (from the base 0 of pmem) and
the size of each partition, which is called as namespace. On Linux,
each namespace is exposed as a /dev/pmemXX device.

In the virtualization, the (virtual) label storage area of vNVDIMM and
the corresponding _DSM method are emulated by QEMU. The virtual label
storage area is not written to the host one. Instead, we can reserve a
piece area on pmem for the virtual one.

Besides namespaces, we can also create DAX file systems on pmem and
use files to partition.

Haozhong

> > So if Xen is going to use PMEM, it will have to invent an entirely new
> > interface for guests, and it will have to keep track of those
> > resources across host reboots.  In other words, it will have to
> > duplicate all the work that Linux already does.  What do we gain from
> > that duplication?  Why not just leverage what's already implemented in
> > dom0?
> 
> Indeed if my guessing on the intentions was wrong, then the
> picture completely changes (also for the points you've made
> further down).
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 15:30                                         ` Haozhong Zhang
@ 2016-01-26 15:33                                           ` Haozhong Zhang
  2016-01-26 15:57                                           ` Jan Beulich
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-26 15:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 01/26/16 23:30, Haozhong Zhang wrote:
> On 01/26/16 05:44, Jan Beulich wrote:
> > >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> > > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> > >>> On 01/21/16 03:25, Jan Beulich wrote:
> > >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> > >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> > >>>> >     VMs.
> > >>>>
> > >>>> Yes.
> > >>>>
> > >>>
> > >>> But I Still do not quite understand this part: why must pmem resource
> > >>> management and partition be done in hypervisor?
> > >>
> > >> Because that's where memory management belongs. And PMEM,
> > >> other than PBLK, is just another form of RAM.
> > > 
> > > I haven't looked more deeply into the details of this, but this
> > > argument doesn't seem right to me.
> > > 
> > > Normal RAM in Xen is what might be called "fungible" -- at boot, all
> > > RAM is zeroed, and it basically doesn't matter at all what RAM is
> > > given to what guest.  (There are restrictions of course: lowmem for
> > > DMA, contiguous superpages, &c; but within those groups, it doesn't
> > > matter *which* bit of lowmem you get, as long as you get enough to do
> > > your job.)  If you reboot your guest or hand RAM back to the
> > > hypervisor, you assume that everything in it will disappear.  When you
> > > ask for RAM, you can request some parameters that it will have
> > > (lowmem, on a specific node, &c), but you can't request a specific
> > > page that you had before.
> > > 
> > > This is not the case for PMEM.  The whole point of PMEM (correct me if
> > > I'm wrong) is to be used for long-term storage that survives over
> > > reboot.  It matters very much that a guest be given the same PRAM
> > > after the host is rebooted that it was given before.  It doesn't make
> > > any sense to manage it the way Xen currently manages RAM (i.e., that
> > > you request a page and get whatever Xen happens to give you).
> > 
> > Interesting. This isn't the usage model I have been thinking about
> > so far. Having just gone back to the original 0/4 mail, I'm afraid
> > we're really left guessing, and you guessed differently than I did.
> > My understanding of the intentions of PMEM so far was that this
> > is a high-capacity, slower than DRAM but much faster than e.g.
> > swapping to disk alternative to normal RAM. I.e. the persistent
> > aspect of it wouldn't matter at all in this case (other than for PBLK,
> > obviously).
> >
> 
> Of course, pmem could be used in the way you thought because of its
> 'ram' aspect. But I think the more meaningful usage is from its
> persistent aspect. For example, the implementation of some journal
> file systems could store logs in pmem rather than the normal ram, so
> that if a power failure happens before those in-memory logs are
> completely written to the disk, there would still be chance to restore
> them from pmem after next booting (rather than abandoning all of
> them).
> 
> (I'm still writing the design doc which will include more details of
> underlying hardware and the software interface of nvdimm exposed by
> current linux)
> 
> > However, thinking through your usage model I have problems
> > seeing it work in a reasonable way even with virtualization left
> > aside: To my knowledge there's no established protocol on how
> > multiple parties (different versions of the same OS, or even
> > completely different OSes) would arbitrate using such memory
> > ranges. And even for a single OS it is, other than for disks (and
> > hence PBLK), not immediately clear how it would communicate
> > from one boot to another what information got stored where,
> > or how it would react to some or all of this storage having
> > disappeared (just like a disk which got removed, which - unless
> > it held the boot partition - would normally have pretty little
> > effect on the OS coming back up).
> >
> 
> Label storage area is a persistent area on NVDIMM and can be used to
> store partitions information. It's not included in pmem (that part
> that is mapped into the system address space). Instead, it can be only
> accessed through NVDIMM _DSM method [1]. However, what contents are
> stored and how they are interpreted are left to software. One way is
> to follow NVDIMM Namespace Specification [2] to store an array of
> labels that describe the start address (from the base 0 of pmem) and
> the size of each partition, which is called as namespace. On Linux,
> each namespace is exposed as a /dev/pmemXX device.
> 
> In the virtualization, the (virtual) label storage area of vNVDIMM and
> the corresponding _DSM method are emulated by QEMU. The virtual label
> storage area is not written to the host one. Instead, we can reserve a
> piece area on pmem for the virtual one.
> 
> Besides namespaces, we can also create DAX file systems on pmem and
> use files to partition.
>

Forgot references:
[1] NVDIMM DSM Interface Examples, http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2] NVDIMM Namespace Specification, http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

> Haozhong
> 
> > > So if Xen is going to use PMEM, it will have to invent an entirely new
> > > interface for guests, and it will have to keep track of those
> > > resources across host reboots.  In other words, it will have to
> > > duplicate all the work that Linux already does.  What do we gain from
> > > that duplication?  Why not just leverage what's already implemented in
> > > dom0?
> > 
> > Indeed if my guessing on the intentions was wrong, then the
> > picture completely changes (also for the points you've made
> > further down).
> > 
> > Jan
> > 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 14:44                                           ` Konrad Rzeszutek Wilk
@ 2016-01-26 15:37                                             ` Jan Beulich
  2016-01-26 15:57                                               ` Haozhong Zhang
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-26 15:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>>  Last year at Linux Plumbers Conference I attended a session dedicated
>> to NVDIMM support. I asked the very same question and the INTEL guy
>> there told me there is indeed something like a partition table meant
>> to describe the layout of the memory areas and their contents.
> 
> It is described in details at pmem.io, look at  Documents, see
> http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.

Well, that's about how PMEM and PBLK ranges get marked, but not
about how use of the space inside a PMEM range is coordinated.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 15:30                                         ` Haozhong Zhang
  2016-01-26 15:33                                           ` Haozhong Zhang
@ 2016-01-26 15:57                                           ` Jan Beulich
  2016-01-27  2:23                                             ` Haozhong Zhang
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-26 15:57 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote:
> On 01/26/16 05:44, Jan Beulich wrote:
>> Interesting. This isn't the usage model I have been thinking about
>> so far. Having just gone back to the original 0/4 mail, I'm afraid
>> we're really left guessing, and you guessed differently than I did.
>> My understanding of the intentions of PMEM so far was that this
>> is a high-capacity, slower than DRAM but much faster than e.g.
>> swapping to disk alternative to normal RAM. I.e. the persistent
>> aspect of it wouldn't matter at all in this case (other than for PBLK,
>> obviously).
> 
> Of course, pmem could be used in the way you thought because of its
> 'ram' aspect. But I think the more meaningful usage is from its
> persistent aspect. For example, the implementation of some journal
> file systems could store logs in pmem rather than the normal ram, so
> that if a power failure happens before those in-memory logs are
> completely written to the disk, there would still be chance to restore
> them from pmem after next booting (rather than abandoning all of
> them).

Well, that leaves open how that file system would find its log
after reboot, or how that log is protected from clobbering by
another OS booted in between.

>> However, thinking through your usage model I have problems
>> seeing it work in a reasonable way even with virtualization left
>> aside: To my knowledge there's no established protocol on how
>> multiple parties (different versions of the same OS, or even
>> completely different OSes) would arbitrate using such memory
>> ranges. And even for a single OS it is, other than for disks (and
>> hence PBLK), not immediately clear how it would communicate
>> from one boot to another what information got stored where,
>> or how it would react to some or all of this storage having
>> disappeared (just like a disk which got removed, which - unless
>> it held the boot partition - would normally have pretty little
>> effect on the OS coming back up).
> 
> Label storage area is a persistent area on NVDIMM and can be used to
> store partitions information. It's not included in pmem (that part
> that is mapped into the system address space). Instead, it can be only
> accessed through NVDIMM _DSM method [1]. However, what contents are
> stored and how they are interpreted are left to software. One way is
> to follow NVDIMM Namespace Specification [2] to store an array of
> labels that describe the start address (from the base 0 of pmem) and
> the size of each partition, which is called as namespace. On Linux,
> each namespace is exposed as a /dev/pmemXX device.

According to what I've just read in one of the documents Konrad
pointed us to, there can be just one PMEM label per DIMM. Unless
I misread of course...

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 15:37                                             ` Jan Beulich
@ 2016-01-26 15:57                                               ` Haozhong Zhang
  2016-01-26 16:34                                                 ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-26 15:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell, George Dunlap,
	Andrew Cooper, Stefano Stabellini, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 01/26/16 08:37, Jan Beulich wrote:
> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> there told me there is indeed something like a partition table meant
> >> to describe the layout of the memory areas and their contents.
> > 
> > It is described in details at pmem.io, look at  Documents, see
> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> 
> Well, that's about how PMEM and PBLK ranges get marked, but not
> about how use of the space inside a PMEM range is coordinated.
>

How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT table.
Namespace to pmem is something like partition table to disk.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 15:57                                               ` Haozhong Zhang
@ 2016-01-26 16:34                                                 ` Jan Beulich
  2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
  2016-01-27 10:55                                                   ` George Dunlap
  0 siblings, 2 replies; 88+ messages in thread
From: Jan Beulich @ 2016-01-26 16:34 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> On 01/26/16 08:37, Jan Beulich wrote:
>> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>> >> to NVDIMM support. I asked the very same question and the INTEL guy
>> >> there told me there is indeed something like a partition table meant
>> >> to describe the layout of the memory areas and their contents.
>> > 
>> > It is described in details at pmem.io, look at  Documents, see
>> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>> 
>> Well, that's about how PMEM and PBLK ranges get marked, but not
>> about how use of the space inside a PMEM range is coordinated.
>>
> 
> How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> table.
> Namespace to pmem is something like partition table to disk.

But I'm talking about sub-dividing the space inside an individual
PMEM range.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 16:34                                                 ` Jan Beulich
@ 2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
  2016-01-27  7:22                                                     ` Haozhong Zhang
  2016-01-27 10:16                                                     ` Jan Beulich
  2016-01-27 10:55                                                   ` George Dunlap
  1 sibling, 2 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-26 19:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> > On 01/26/16 08:37, Jan Beulich wrote:
> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> >> there told me there is indeed something like a partition table meant
> >> >> to describe the layout of the memory areas and their contents.
> >> > 
> >> > It is described in details at pmem.io, look at  Documents, see
> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> >> 
> >> Well, that's about how PMEM and PBLK ranges get marked, but not
> >> about how use of the space inside a PMEM range is coordinated.
> >>
> > 
> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> > table.
> > Namespace to pmem is something like partition table to disk.
> 
> But I'm talking about sub-dividing the space inside an individual
> PMEM range.

The namespaces are it.

Once you have done them you can mount the PMEM range under say /dev/pmem0
and then put a filesystem on it (ext4, xfs) - and enable DAX support.
The DAX just means that the FS will bypass the page cache and write directly
to the virtual address.

then one can create giant 'dd' images on this filesystem and pass it
to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks
(or MFNs) for the contents of the file are most certainly discontingous.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 15:57                                           ` Jan Beulich
@ 2016-01-27  2:23                                             ` Haozhong Zhang
  0 siblings, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-27  2:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Wei Liu, Ian Campbell, Stefano Stabellini,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On 01/26/16 08:57, Jan Beulich wrote:
> >>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote:
> > On 01/26/16 05:44, Jan Beulich wrote:
> >> Interesting. This isn't the usage model I have been thinking about
> >> so far. Having just gone back to the original 0/4 mail, I'm afraid
> >> we're really left guessing, and you guessed differently than I did.
> >> My understanding of the intentions of PMEM so far was that this
> >> is a high-capacity, slower than DRAM but much faster than e.g.
> >> swapping to disk alternative to normal RAM. I.e. the persistent
> >> aspect of it wouldn't matter at all in this case (other than for PBLK,
> >> obviously).
> > 
> > Of course, pmem could be used in the way you thought because of its
> > 'ram' aspect. But I think the more meaningful usage is from its
> > persistent aspect. For example, the implementation of some journal
> > file systems could store logs in pmem rather than the normal ram, so
> > that if a power failure happens before those in-memory logs are
> > completely written to the disk, there would still be chance to restore
> > them from pmem after next booting (rather than abandoning all of
> > them).
> 
> Well, that leaves open how that file system would find its log
> after reboot, or how that log is protected from clobbering by
> another OS booted in between.
>

It would depend on the concrete design of those OS or
applications. This is just an example to show a possible usage of the
persistent aspect.

> >> However, thinking through your usage model I have problems
> >> seeing it work in a reasonable way even with virtualization left
> >> aside: To my knowledge there's no established protocol on how
> >> multiple parties (different versions of the same OS, or even
> >> completely different OSes) would arbitrate using such memory
> >> ranges. And even for a single OS it is, other than for disks (and
> >> hence PBLK), not immediately clear how it would communicate
> >> from one boot to another what information got stored where,
> >> or how it would react to some or all of this storage having
> >> disappeared (just like a disk which got removed, which - unless
> >> it held the boot partition - would normally have pretty little
> >> effect on the OS coming back up).
> > 
> > Label storage area is a persistent area on NVDIMM and can be used to
> > store partitions information. It's not included in pmem (that part
> > that is mapped into the system address space). Instead, it can be only
> > accessed through NVDIMM _DSM method [1]. However, what contents are
> > stored and how they are interpreted are left to software. One way is
> > to follow NVDIMM Namespace Specification [2] to store an array of
> > labels that describe the start address (from the base 0 of pmem) and
> > the size of each partition, which is called as namespace. On Linux,
> > each namespace is exposed as a /dev/pmemXX device.
> 
> According to what I've just read in one of the documents Konrad
> pointed us to, there can be just one PMEM label per DIMM. Unless
> I misread of course...
>

My mistake, only one pmem label per DIMM.

Haozhong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
@ 2016-01-27  7:22                                                     ` Haozhong Zhang
  2016-01-27 10:16                                                     ` Jan Beulich
  1 sibling, 0 replies; 88+ messages in thread
From: Haozhong Zhang @ 2016-01-27  7:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Jun Nakajima, Xiao Guangrong,
	Keir Fraser

On 01/26/16 14:32, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> > >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> > > On 01/26/16 08:37, Jan Beulich wrote:
> > >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> > >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> > >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> > >> >> there told me there is indeed something like a partition table meant
> > >> >> to describe the layout of the memory areas and their contents.
> > >> > 
> > >> > It is described in details at pmem.io, look at  Documents, see
> > >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> > >> 
> > >> Well, that's about how PMEM and PBLK ranges get marked, but not
> > >> about how use of the space inside a PMEM range is coordinated.
> > >>
> > > 
> > > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> > > table.
> > > Namespace to pmem is something like partition table to disk.
> > 
> > But I'm talking about sub-dividing the space inside an individual
> > PMEM range.
> 
> The namespaces are it.
>

Because only one persistent memory namespace is allowed for an
individual pmem, namespace can not be used to sub-divide.

> Once you have done them you can mount the PMEM range under say /dev/pmem0
> and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> The DAX just means that the FS will bypass the page cache and write directly
> to the virtual address.
> 
> then one can create giant 'dd' images on this filesystem and pass it
> to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks
> (or MFNs) for the contents of the file are most certainly discontingous.
>

Though the 'dd' image may occupy discontingous MFNs on host pmem, we can map them
to contiguous guest PFNs.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
  2016-01-27  7:22                                                     ` Haozhong Zhang
@ 2016-01-27 10:16                                                     ` Jan Beulich
  2016-01-27 14:50                                                       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2016-01-27 10:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

>>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote:
> On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
>> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
>> > On 01/26/16 08:37, Jan Beulich wrote:
>> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
>> >> >> there told me there is indeed something like a partition table meant
>> >> >> to describe the layout of the memory areas and their contents.
>> >> > 
>> >> > It is described in details at pmem.io, look at  Documents, see
>> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>> >> 
>> >> Well, that's about how PMEM and PBLK ranges get marked, but not
>> >> about how use of the space inside a PMEM range is coordinated.
>> >>
>> > 
>> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
>> > table.
>> > Namespace to pmem is something like partition table to disk.
>> 
>> But I'm talking about sub-dividing the space inside an individual
>> PMEM range.
> 
> The namespaces are it.
> 
> Once you have done them you can mount the PMEM range under say /dev/pmem0
> and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> The DAX just means that the FS will bypass the page cache and write directly
> to the virtual address.
> 
> then one can create giant 'dd' images on this filesystem and pass it
> to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the 
> blocks
> (or MFNs) for the contents of the file are most certainly discontingous.

And what's the advantage of this over PBLK? I.e. why would one
want to separate PMEM and PBLK ranges if everything gets used
the same way anyway?

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-26 16:34                                                 ` Jan Beulich
  2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
@ 2016-01-27 10:55                                                   ` George Dunlap
  1 sibling, 0 replies; 88+ messages in thread
From: George Dunlap @ 2016-01-27 10:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong, Keir Fraser

On Tue, Jan 26, 2016 at 4:34 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
>> On 01/26/16 08:37, Jan Beulich wrote:
>>> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>>> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>>> >> to NVDIMM support. I asked the very same question and the INTEL guy
>>> >> there told me there is indeed something like a partition table meant
>>> >> to describe the layout of the memory areas and their contents.
>>> >
>>> > It is described in details at pmem.io, look at  Documents, see
>>> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>>>
>>> Well, that's about how PMEM and PBLK ranges get marked, but not
>>> about how use of the space inside a PMEM range is coordinated.
>>>
>>
>> How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT
>> table.
>> Namespace to pmem is something like partition table to disk.
>
> But I'm talking about sub-dividing the space inside an individual
> PMEM range.

Well as long as at a high level full PMEM blocks can be allocated /
marked to a single OS, then that OS can figure out if / how to further
subdivide them (and store information about that subdivision).

But in any case, since it seems from what Haozhong and Konrad say,
that the point of this *is* in fact to take advantage of the
persistence, then it seems like allowing Linux to solve the problem of
how to subdivide PMEM blocks and just leveraging their solution would
be better than trying to duplicate all that effort inside of Xen.

 -George

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu
  2016-01-27 10:16                                                     ` Jan Beulich
@ 2016-01-27 14:50                                                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 88+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-27 14:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Haozhong Zhang, Kevin Tian, Wei Liu, Ian Campbell,
	Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jun Nakajima, Xiao Guangrong, Keir Fraser

On Wed, Jan 27, 2016 at 03:16:59AM -0700, Jan Beulich wrote:
> >>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote:
> > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> >> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> >> > On 01/26/16 08:37, Jan Beulich wrote:
> >> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> >> >> there told me there is indeed something like a partition table meant
> >> >> >> to describe the layout of the memory areas and their contents.
> >> >> > 
> >> >> > It is described in details at pmem.io, look at  Documents, see
> >> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> >> >> 
> >> >> Well, that's about how PMEM and PBLK ranges get marked, but not
> >> >> about how use of the space inside a PMEM range is coordinated.
> >> >>
> >> > 
> >> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> >> > table.
> >> > Namespace to pmem is something like partition table to disk.
> >> 
> >> But I'm talking about sub-dividing the space inside an individual
> >> PMEM range.
> > 
> > The namespaces are it.
> > 
> > Once you have done them you can mount the PMEM range under say /dev/pmem0
> > and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> > The DAX just means that the FS will bypass the page cache and write directly
> > to the virtual address.
> > 
> > then one can create giant 'dd' images on this filesystem and pass it
> > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the 
> > blocks
> > (or MFNs) for the contents of the file are most certainly discontingous.
> 
> And what's the advantage of this over PBLK? I.e. why would one
> want to separate PMEM and PBLK ranges if everything gets used
> the same way anyway?

Speed. PBLK emulates hardware - by having a sliding window of the DIMM. The
OS can only write to a ring-buffer with the system address and the payload
(64bytes I think?) - and the hardware (or firmware) picks it up and does the
writes to NVDIMM.

The only motivation behind this is to deal with errors. Normal PMEM writes
do not report errors. As in if the media is busted - the hardware will engage its
remap logic and write somewhere else - until all of its remap blocks have
been exhausted. At that point writes (I presume, not sure) and reads will report
an error - but via an #MCE. 

Part of this Xen design will be how to handle that :-)

With an PBLK - I presume the hardware/firmware will read the block after it has
written it - and if there are errors it will report it right away. Which means
you can easily hook PBLK nicely in RAID setups right away. It will be slower
than PMEM, but it does give you the normal error reporting. That is until
the MCE#->OS->fs errors logic gets figured out.

The MCE# logic code is being developed right now by Tony Luck on LKML - and
the last I saw the MCE# has the system address - and the MCE code would tag
the pages with some bit so that the applications would get a signal.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2016-01-27 14:50 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-29 11:31 [PATCH 0/4] add support for vNVDIMM Haozhong Zhang
2015-12-29 11:31 ` [PATCH 1/4] x86/hvm: allow guest to use clflushopt and clwb Haozhong Zhang
2015-12-29 15:46   ` Andrew Cooper
2015-12-30  1:35     ` Haozhong Zhang
2015-12-30  2:16       ` Haozhong Zhang
2015-12-30 10:33         ` Andrew Cooper
2015-12-29 11:31 ` [PATCH 2/4] x86/hvm: add support for pcommit instruction Haozhong Zhang
2015-12-29 11:31 ` [PATCH 3/4] tools/xl: add a new xl configuration 'nvdimm' Haozhong Zhang
2016-01-04 11:16   ` Wei Liu
2016-01-06 12:40   ` Jan Beulich
2016-01-06 15:28     ` Haozhong Zhang
2015-12-29 11:31 ` [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu Haozhong Zhang
2016-01-15 17:10   ` Jan Beulich
2016-01-18  0:52     ` Haozhong Zhang
2016-01-18  8:46       ` Jan Beulich
2016-01-19 11:37         ` Wei Liu
2016-01-19 11:46           ` Jan Beulich
2016-01-20  5:14             ` Tian, Kevin
2016-01-20  5:58               ` Zhang, Haozhong
2016-01-20  5:31         ` Haozhong Zhang
2016-01-20  8:46           ` Jan Beulich
2016-01-20  8:58             ` Andrew Cooper
2016-01-20 10:15               ` Haozhong Zhang
2016-01-20 10:36                 ` Xiao Guangrong
2016-01-20 13:16                   ` Andrew Cooper
2016-01-20 14:29                     ` Stefano Stabellini
2016-01-20 14:42                       ` Haozhong Zhang
2016-01-20 14:45                       ` Andrew Cooper
2016-01-20 14:53                         ` Haozhong Zhang
2016-01-20 15:13                           ` Konrad Rzeszutek Wilk
2016-01-20 15:29                             ` Haozhong Zhang
2016-01-20 15:41                               ` Konrad Rzeszutek Wilk
2016-01-20 15:54                                 ` Haozhong Zhang
2016-01-21  3:35                                 ` Bob Liu
2016-01-20 15:05                         ` Stefano Stabellini
2016-01-20 18:14                           ` Andrew Cooper
2016-01-20 14:38                     ` Haozhong Zhang
2016-01-20 11:04             ` Haozhong Zhang
2016-01-20 11:20               ` Jan Beulich
2016-01-20 15:29                 ` Xiao Guangrong
2016-01-20 15:47                   ` Konrad Rzeszutek Wilk
2016-01-20 16:25                     ` Xiao Guangrong
2016-01-20 16:47                       ` Konrad Rzeszutek Wilk
2016-01-20 16:55                         ` Xiao Guangrong
2016-01-20 17:18                           ` Konrad Rzeszutek Wilk
2016-01-20 17:23                             ` Xiao Guangrong
2016-01-20 17:48                               ` Konrad Rzeszutek Wilk
2016-01-21  3:12                             ` Haozhong Zhang
2016-01-20 17:07                   ` Jan Beulich
2016-01-20 17:17                     ` Xiao Guangrong
2016-01-21  8:18                       ` Jan Beulich
2016-01-21  8:25                         ` Xiao Guangrong
2016-01-21  8:53                           ` Jan Beulich
2016-01-21  9:10                             ` Xiao Guangrong
2016-01-21  9:29                               ` Andrew Cooper
2016-01-21 10:26                                 ` Jan Beulich
2016-01-21 10:25                               ` Jan Beulich
2016-01-21 14:01                                 ` Haozhong Zhang
2016-01-21 14:52                                   ` Jan Beulich
2016-01-22  2:43                                     ` Haozhong Zhang
2016-01-26 11:44                                     ` George Dunlap
2016-01-26 12:44                                       ` Jan Beulich
2016-01-26 12:54                                         ` Juergen Gross
2016-01-26 14:44                                           ` Konrad Rzeszutek Wilk
2016-01-26 15:37                                             ` Jan Beulich
2016-01-26 15:57                                               ` Haozhong Zhang
2016-01-26 16:34                                                 ` Jan Beulich
2016-01-26 19:32                                                   ` Konrad Rzeszutek Wilk
2016-01-27  7:22                                                     ` Haozhong Zhang
2016-01-27 10:16                                                     ` Jan Beulich
2016-01-27 14:50                                                       ` Konrad Rzeszutek Wilk
2016-01-27 10:55                                                   ` George Dunlap
2016-01-26 13:58                                         ` George Dunlap
2016-01-26 14:46                                           ` Konrad Rzeszutek Wilk
2016-01-26 15:30                                         ` Haozhong Zhang
2016-01-26 15:33                                           ` Haozhong Zhang
2016-01-26 15:57                                           ` Jan Beulich
2016-01-27  2:23                                             ` Haozhong Zhang
2016-01-20 15:07               ` Konrad Rzeszutek Wilk
2016-01-06 15:37 ` [PATCH 0/4] add support for vNVDIMM Ian Campbell
2016-01-06 15:47   ` Haozhong Zhang
2016-01-20  3:28 ` Tian, Kevin
2016-01-20 12:43   ` Stefano Stabellini
2016-01-20 14:26     ` Zhang, Haozhong
2016-01-20 14:35       ` Stefano Stabellini
2016-01-20 14:47         ` Zhang, Haozhong
2016-01-20 14:54           ` Andrew Cooper
2016-01-20 15:59             ` Haozhong Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.