kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/10] device-dax: Support devices without PFN metadata
@ 2020-01-10 19:03 Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Hey,

Presented herewith a small series which allows device-dax to work without
struct page to be used to back KVM guests memory. It's an RFC, and there's
still some items we're looking at (see TODO below); but wondering if folks
would be OK carving some time out of their busy schedules to provide feedback
direction-wise on this work.

In virtualized environments (specially those with no kernel-backed PV
interfaces, and just SR-IOV), memory is largelly assigned to guests: either
persistent with NVDIMMs or volatile for regular RAM. The kernel
(hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
need  which could instead be used to create other guests. One of motivations of
this series is to then get that memory used for 'struct page', when it is meant
to solely be used by userspace. This is also useful for the case of memory
backing guests virtual NVDIMMs. The other neat side effect is that the
hypervisor has no virtual mapping of the guest and hence code gadgets (if
found) are limited in their effectiveness.

It is expected that a smaller (instead of total) amount of host memory is
defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
QEMU), the main thing that is needed is a device which open + mmap + close with
a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
just that and so the work comprised here was improving what's there and the
interfaces it uses.

The series is divided as follows:

 * Patch 1 , 3: Preparatory work for patch 7 for adding support for
	       vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
	
 * Patch 2 , 4: Preparatory work for patch 7 for adding support for
	       follow_pfn() to work with 2M/1G huge pages, which is
	       what KVM uses for VM_PFNMAP.

 * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
	       which encompasses mainly dealing with the lack of devmap,
	       and creating a VM_PFNMAP vma.

 * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
	   At the very end of the cover letter (after scissors mark),
	   there's a patch for ndctl to be able to create namespaces
	   with '--mode devdax --map none'.

 * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
 	    a PFN.

 * Patch 10: The actual end consumer example for RAM case. The patch just adds a
	     label storage area which consequently allows namespaces to be
	     created. We picked PMEM legacy for starters.

Thoughts, coments appreciated.

	Joao

P.S. As an example to try this out:

 1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
 and kernel has 16G.

 2) create a devdax namespace with 1G hugepages: 

 $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
 {
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"none",
  "size":"32.00 GiB (34.36 GB)",
  "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
  "daxregion":{
    "id":0,
    "size":"32.00 GiB (34.36 GB)",
    "align":1073741824,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"32.00 GiB (34.36 GB)",
        "target_node":0,
        "mode":"devdax"
      }
    ]
  },
  "align":1073741824
 }

 3) Add this to your qemu params:
  -m 32G 
  -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
  -numa node,memdev=mem

TODO:

 * Discontiguous regions/namespaces: The work above is limited to max
contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
is because of what specs allow for persistent namespaces. But for volatile RAM
case we would need handling of discontiguous extents (hence a region would represent
more than a resource) to be less bound to how guests are placed on the system.
I played around with multi-resource for device-dax, but I'm wondering about
UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
case (under certain nvdimm region bits); or if 2) device-dax would have it's
own separate UABI to be used by daxctl (which would be also useful for hmem
devices?).

 * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
device-dax, which would allow collect_procs() to find the processes solely based
on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
looking at the dax-region pfn range for each DAX vma.

 * NUMA: For now excluded setting the target_node; while these two patches
 are being worked on[1][2].

 [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
 [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/


Joao Martins (9):
  mm: Add pmd support for _PAGE_SPECIAL
  mm: Handle pmd entries in follow_pfn()
  mm: Add pud support for _PAGE_SPECIAL
  mm: Handle pud entries in follow_pfn()
  device-dax: Do not enforce MADV_DONTFORK on mmap()
  device-dax: Introduce pfn_flags helper
  device-dax: Add support for PFN_SPECIAL flags
  dax/pmem: Add device-dax support for PFN_MODE_NONE
  nvdimm/e820: add multiple namespaces support

Nikita Leshenko (1):
  vfio/type1: Use follow_pfn for VM_FPNMAP VMAs

 arch/x86/include/asm/pgtable.h  |  34 ++++-
 drivers/dax/bus.c               |   3 +-
 drivers/dax/device.c            |  78 ++++++++----
 drivers/dax/pmem/core.c         |  36 +++++-
 drivers/nvdimm/e820.c           | 212 ++++++++++++++++++++++++++++----
 drivers/vfio/vfio_iommu_type1.c |   6 +-
 mm/gup.c                        |   6 +
 mm/huge_memory.c                |  15 ++-
 mm/memory.c                     |  67 ++++++++--
 9 files changed, 382 insertions(+), 75 deletions(-)

8>----------------
Subject: [PATCH] ndctl: add 'devdax' support for NDCTL_PFN_LOC_NONE

diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 7fb00078646b..2568943eb207 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -206,6 +206,8 @@ static int set_defaults(enum device_action mode)
 			/* pass */;
 		else if (strcmp(param.map, "dev") == 0)
 			/* pass */;
+		else if (strcmp(param.map, "none") == 0)
+			/* pass */;
 		else {
 			error("invalid map location '%s'\n", param.map);
 			rc = -EINVAL;
@@ -755,9 +757,17 @@ static int validate_namespace_options(struct ndctl_region *region,
 	if (param.map) {
 		if (!strcmp(param.map, "mem"))
 			p->loc = NDCTL_PFN_LOC_RAM;
+		else if (!strcmp(param.map, "none"))
+			p->loc = NDCTL_PFN_LOC_NONE;
 		else
 			p->loc = NDCTL_PFN_LOC_PMEM;
 
+		if (p->loc == NDCTL_PFN_LOC_NONE
+			&& p->mode != NDCTL_NS_MODE_DAX) {
+			debug("--map=none only valid for devdax mode namespace\n");
+			return -EINVAL;
+		}
+
 		if (ndns && p->mode != NDCTL_NS_MODE_MEMORY
 			&& p->mode != NDCTL_NS_MODE_DAX) {
 			debug("%s: --map= only valid for fsdax mode namespace\n",


-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-02-03 21:34   ` Matthew Wilcox
  2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently vmf_insert_pfn_pmd only works with devmap
and BUG_ON otherwise. Add support for handling
page special when pfn_t has it marked with PFN_SPECIAL.

Usage of page special aren't expected to do GUP
hence return no pages on gup_huge_pmd() much like how
it is done for ptes on gup_pte_range().

This allows a DAX driver to handle 2M hugepages without
struct pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/include/asm/pgtable.h | 16 +++++++++++++++-
 mm/gup.c                       |  3 +++
 mm/huge_memory.c               |  7 ++++---
 mm/memory.c                    |  3 ++-
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ad97dc155195..60351c0c15fe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -255,7 +255,7 @@ static inline int pmd_large(pmd_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
 {
 	return 0;
 }
+#endif
+
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline int pmd_special(pmd_t pmd)
+{
+	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
+}
+#endif
+
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -414,6 +423,11 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
 }
 
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..ba5f10535392 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
 	}
 
+	if (pmd_special(orig))
+		return 0;
+
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..06ad4d6f7477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else if (pfn_t_special(pfn))
+		entry = pmd_mkspecial(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -2013,7 +2014,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+			pmd_devmap(*pmd) || pmd_special(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9a4f52..cfc3668bddeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1165,7 +1165,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+		    pmd_devmap(*pmd) || pmd_special(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-02-03 21:37   ` Matthew Wilcox
  2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

When follow_pfn hits a pmd_huge() it won't return a valid PFN
given it's usage of follow_pte(). Fix that up to pass a @pmdpp
and thus allow callers to get the pmd pointer. If we encounter
such a huge page, we calculate the pfn offset to the PMD
accordingly.

This allows KVM to handle 2M hugepage pfns on VM_PFNMAP vmas.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index cfc3668bddeb..db99684d2cb3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4366,6 +4366,7 @@ EXPORT_SYMBOL(follow_pte_pmd);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn)
 {
+	pmd_t *pmdpp = NULL;
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -4373,10 +4374,14 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
+	ret = follow_pte_pmd(vma->vm_mm, address, NULL,
+			     &ptep, &pmdpp, &ptl);
 	if (ret)
 		return ret;
-	*pfn = pte_pfn(*ptep);
+	if (pmdpp)
+		*pfn = pmd_pfn(*pmdpp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+	else
+		*pfn = pte_pfn(*ptep);
 	pte_unmap_unlock(ptep, ptl);
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently vmf_insert_pfn_pud only works with devmap
and BUG_ON otherwise. Add support for handling
page special when pfn_t has it marked with PFN_SPECIAL.

Usage of this type of pages aren't expected to do GUP
hence return no pages on gup_huge_pud() much like how
it is done for ptes on gup_pte_range() and for pmds on
gup_huge_pmd().

This allows device-dax to handle 1G hugepages without
struct pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/include/asm/pgtable.h | 18 +++++++++++++++++-
 mm/gup.c                       |  3 +++
 mm/huge_memory.c               |  8 +++++---
 mm/memory.c                    |  3 ++-
 4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 60351c0c15fe..2027c063fa16 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -261,7 +261,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
 }
 #endif
 
@@ -300,6 +300,17 @@ static inline int pmd_special(pmd_t pmd)
 {
 	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
 }
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline int pud_special(pud_t pud)
+{
+	return !!(pud_flags(pud) & _PAGE_SPECIAL);
+}
+#else
+static inline int pud_special(pud_t pud)
+{
+	return 0;
+}
 #endif
 
 #endif
@@ -487,6 +498,11 @@ static inline pud_t pud_mkhuge(pud_t pud)
 	return pud_set_flags(pud, _PAGE_PSE);
 }
 
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL);
+}
+
 static inline pud_t pud_mkyoung(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_ACCESSED);
diff --git a/mm/gup.c b/mm/gup.c
index ba5f10535392..ae4abe5878ad 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2123,6 +2123,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
 	}
 
+	if (pud_special(orig))
+		return 0;
+
 	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06ad4d6f7477..cff707163bc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -879,6 +879,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	else if (pfn_t_special(pfn))
+		entry = pud_mkspecial(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -901,8 +903,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -2031,7 +2032,8 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)) ||
+		   pud_special(*pud))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index db99684d2cb3..109643219e1b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1201,7 +1201,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud) || pud_devmap(*pud) ||
+		    pud_special(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
 				split_huge_pud(vma, pud, addr);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (2 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

When follow_pfn hits a pud_huge() it won't return a valid PFN
for a PUD. Fix it by adding @pudp and thus allow callers to
get the pud pointer. If we encounter such a huge page, we
calculate the offset to the PUD accordingly.

This allows KVM to handle 1G hugepage pfns on VM_PFNMAP vmas.

Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 109643219e1b..f46646630497 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4261,9 +4261,10 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+static int __follow_pte_pud(struct mm_struct *mm, unsigned long address,
 			    struct mmu_notifier_range *range,
-			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp,
+			    spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4280,6 +4281,28 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pud = pud_offset(p4d, address);
+	VM_BUG_ON(pud_trans_huge(*pud));
+
+	if (pud_huge(*pud)) {
+		if (!pudpp)
+			goto out;
+
+		if (range) {
+			mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0,
+						NULL, mm, address & PUD_MASK,
+						(address & PUD_MASK) + PUD_SIZE);
+			mmu_notifier_invalidate_range_start(range);
+		}
+		*ptlp = pud_lock(mm, pud);
+		if (pud_huge(*pud)) {
+			*pudpp = pud;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+		if (range)
+			mmu_notifier_invalidate_range_end(range);
+	}
+
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto out;
 
@@ -4335,8 +4358,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, NULL,
-						    ptepp, NULL, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, NULL,
+						    ptepp, NULL, NULL, ptlp)));
 	return res;
 }
 
@@ -4348,12 +4371,26 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, range,
-						    ptepp, pmdpp, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, NULL, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
 
+static int follow_pte_pud(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range,
+			  pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp,
+			  spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, pudpp, ptlp)));
+	return res;
+}
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4368,6 +4405,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn)
 {
 	pmd_t *pmdpp = NULL;
+	pud_t *pudpp = NULL;
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -4375,11 +4413,13 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte_pmd(vma->vm_mm, address, NULL,
-			     &ptep, &pmdpp, &ptl);
+	ret = follow_pte_pud(vma->vm_mm, address, NULL,
+			     &ptep, &pmdpp, &pudpp, &ptl);
 	if (ret)
 		return ret;
-	if (pmdpp)
+	if (pudpp)
+		*pfn = pud_pfn(*pudpp) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	else if (pmdpp)
 		*pfn = pmd_pfn(*pmdpp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	else
 		*pfn = pte_pfn(*ptep);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (3 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently check_vma() checks for VM_DONTCOPY for a device-dax
when the dax region is not backed by devmap (i.e. PFN_MAP is not set).
VM_DONTCOPY is set through madvise(MADV_DONTFORK) and it only sets it
at an address returned from mmap(). check_vma() is called at devdax
mmap hence checking VM_DONTCOPY prevents a process from mmap-ing the
device.

Let's not enforce MADV_DONTFORK at mmap(), but rather when it actually
gets used (on fault). The assumptions don't change, as it is
expected to still retain/madvise MADV_DONTFORK after mmap.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 1af823b2fe6b..c6a7f5e12c54 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,7 +14,7 @@
 #include "dax-private.h"
 #include "bus.h"
 
-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+static int check_vma_mmap(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		const char *func)
 {
 	struct dax_region *dax_region = dev_dax->region;
@@ -41,17 +41,29 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		return -EINVAL;
 	}
 
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
-			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+	if (!vma_is_dax(vma)) {
 		dev_info_ratelimited(dev,
-				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
+				"%s: %s: fail, vma is not DAX capable\n",
 				current->comm, func);
 		return -EINVAL;
 	}
 
-	if (!vma_is_dax(vma)) {
-		dev_info_ratelimited(dev,
-				"%s: %s: fail, vma is not DAX capable\n",
+	return 0;
+}
+
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+		     const char *func)
+{
+	int rc;
+
+	rc = check_vma_mmap(dev_dax, vma, func);
+	if (rc < 0)
+		return rc;
+
+	if ((dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
+			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+		dev_info_ratelimited(&dev_dax->dev,
+				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
 				current->comm, func);
 		return -EINVAL;
 	}
@@ -315,7 +327,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * fault time.
 	 */
 	id = dax_read_lock();
-	rc = check_vma(dev_dax, vma, __func__);
+	rc = check_vma_mmap(dev_dax, vma, __func__);
 	dax_read_unlock(id);
 	if (rc)
 		return rc;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (4 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Replace PFN_DEV|PFN_MAP check call sites with two helper functions
dax_is_pfn_dev() and dax_is_pfn_map().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index c6a7f5e12c54..113a554de3ee 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,6 +14,17 @@
 #include "dax-private.h"
 #include "bus.h"
 
+static int dax_is_pfn_dev(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV;
+}
+
+static int dax_is_pfn_map(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags &
+		(PFN_DEV|PFN_MAP)) == (PFN_DEV|PFN_MAP);
+}
+
 static int check_vma_mmap(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		const char *func)
 {
@@ -60,8 +71,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	if (rc < 0)
 		return rc;
 
-	if ((dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
-			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+	if (dax_is_pfn_dev(dev_dax) && (vma->vm_flags & VM_DONTCOPY) == 0) {
 		dev_info_ratelimited(&dev_dax->dev,
 				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
 				current->comm, func);
@@ -140,7 +150,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	}
 
 	/* dax pmd mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
+	if (!dax_is_pfn_map(dev_dax)) {
 		dev_dbg(dev, "region lacks devmap flags\n");
 		return VM_FAULT_SIGBUS;
 	}
@@ -190,7 +200,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	}
 
 	/* dax pud mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
+	if (!dax_is_pfn_map(dev_dax)) {
 		dev_dbg(dev, "region lacks devmap flags\n");
 		return VM_FAULT_SIGBUS;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (5 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Right now we assume there's gonna be a PFN_DEV|PFN_MAP which
means it will have a struct page backing the PFN but that is
not placed in normal system RAM zones.

Add support for PFN_DEV|PFN_SPECIAL only and therefore the
underlying vma won't have a struct page. For device dax, this
means not assuming callers will pass a dev_pagemap, and avoid
returning SIGBUS for the lack of PFN_MAP region pfn flag and
finally not setting struct page index/mapping on fault.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/bus.c    |  3 ++-
 drivers/dax/device.c | 40 ++++++++++++++++++++++------------------
 2 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 46e46047a1f7..96ca3ac85278 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -414,7 +414,8 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
-	memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
+	if (pgmap)
+		memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
 
 	/*
 	 * No 'host' or dax_operations since there is no access to this
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 113a554de3ee..aa38f5ff180a 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,6 +14,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+static int dax_is_pfn_special(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags &
+		(PFN_DEV|PFN_SPECIAL)) == (PFN_DEV|PFN_SPECIAL);
+}
+
 static int dax_is_pfn_dev(struct dev_dax *dev_dax)
 {
 	return (dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV;
@@ -104,6 +110,7 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	unsigned int fault_size = PAGE_SIZE;
+	int rc;
 
 	if (check_vma(dev_dax, vmf->vma, __func__))
 		return VM_FAULT_SIGBUS;
@@ -126,7 +133,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
+	if (dax_is_pfn_special(dev_dax))
+		rc = vmf_insert_pfn(vmf->vma, vmf->address, pfn_t_to_pfn(*pfn));
+	else
+		rc = vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
+
+	return rc;
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -149,12 +161,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!dax_is_pfn_map(dev_dax)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -199,12 +205,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pud mappings require pfn_t_devmap() */
-	if (!dax_is_pfn_map(dev_dax)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -266,7 +266,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		rc = VM_FAULT_SIGBUS;
 	}
 
-	if (rc == VM_FAULT_NOPAGE) {
+	if (dax_is_pfn_map(dev_dax) && (rc == VM_FAULT_NOPAGE)) {
 		unsigned long i;
 		pgoff_t pgoff;
 
@@ -344,6 +344,8 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	vma->vm_ops = &dax_vm_ops;
 	vma->vm_flags |= VM_HUGEPAGE;
+	if (dax_is_pfn_special(dev_dax))
+		vma->vm_flags |= VM_PFNMAP;
 	return 0;
 }
 
@@ -450,10 +452,12 @@ int dev_dax_probe(struct device *dev)
 		return -EBUSY;
 	}
 
-	dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
-	addr = devm_memremap_pages(dev, &dev_dax->pgmap);
-	if (IS_ERR(addr))
-		return PTR_ERR(addr);
+	if (dax_is_pfn_map(dev_dax)) {
+		dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+		addr = devm_memremap_pages(dev, &dev_dax->pgmap);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
+	}
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (6 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Allowing dax pmem driver to work without struct pages means
that user will request to not create any PFN metadata by writing
seed's device mode to PFN_MODE_NONE.

When the underlying nd_pfn->mode is PFN_MODE_NONE, most dax_pmem
initialization steps can be skipped because we won't have/need a
pfn superblock for the pagemap/struct-pages. We only allocate
an opaque zeroed object with the chosen align requested, and
finally add PFN_SPECIAL to the region pfn_flags.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/pmem/core.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 2bedf8414fff..67f5604a8291 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -17,15 +17,38 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
 	struct nd_namespace_io *nsio;
 	struct dax_region *dax_region;
 	struct dev_pagemap pgmap = { };
+	struct dev_pagemap *devmap = NULL;
 	struct nd_namespace_common *ndns;
 	struct nd_dax *nd_dax = to_nd_dax(dev);
 	struct nd_pfn *nd_pfn = &nd_dax->nd_pfn;
 	struct nd_region *nd_region = to_nd_region(dev->parent);
+	unsigned long long pfn_flags = PFN_DEV;
 
 	ndns = nvdimm_namespace_common_probe(dev);
 	if (IS_ERR(ndns))
 		return ERR_CAST(ndns);
 
+	rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", &region_id, &id);
+	if (rc != 2)
+		return ERR_PTR(-EINVAL);
+
+	if (is_nd_dax(&nd_pfn->dev) && nd_pfn->mode == PFN_MODE_NONE) {
+		/* allocate a dummy super block */
+		pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb),
+				      GFP_KERNEL);
+		if (!pfn_sb)
+			return ERR_PTR(-ENOMEM);
+
+		memset(pfn_sb, 0, sizeof(*pfn_sb));
+		pfn_sb->align = nd_pfn->align;
+		nd_pfn->pfn_sb = pfn_sb;
+		pfn_flags |= PFN_SPECIAL;
+
+		nsio = to_nd_namespace_io(&ndns->dev);
+		memcpy(&res, &nsio->res, sizeof(res));
+		goto no_pfn_sb;
+	}
+
 	/* parse the 'pfn' info block via ->rw_bytes */
 	rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
 	if (rc)
@@ -45,20 +68,21 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
 		return ERR_PTR(-EBUSY);
 	}
 
-	rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", &region_id, &id);
-	if (rc != 2)
-		return ERR_PTR(-EINVAL);
-
 	/* adjust the dax_region resource to the start of data */
 	memcpy(&res, &pgmap.res, sizeof(res));
 	res.start += offset;
+	devmap = &pgmap;
+	pfn_flags |= PFN_MAP;
+
+no_pfn_sb:
 	dax_region = alloc_dax_region(dev, region_id, &res,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			PFN_DEV|PFN_MAP);
+			pfn_flags);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
-	dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys);
+
+	dev_dax = __devm_create_dev_dax(dax_region, id, devmap, subsys);
 
 	/* child dev_dax instances now own the lifetime of the dax_region */
 	dax_region_put(dax_region);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (7 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-02-07 21:08   ` Jason Gunthorpe
  2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
  2020-02-04  1:24 ` [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Dan Williams
  10 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

From: Nikita Leshenko <nikita.leshchenko@oracle.com>

Unconditionally interpreting vm_pgoff as a PFN is incorrect.

VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
that the VMA doesn't have an associated struct page and is being managed
directly by something other than the core mmu.

Use follow_pfn like KVM does to find the PFN.

Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..1e43581f95ea 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-		if (is_invalid_reserved_pfn(*pfn))
-			ret = 0;
+		ret = follow_pfn(vma, vaddr, pfn);
+		if (!ret && !is_invalid_reserved_pfn(*pfn))
+			ret = -EOPNOTSUPP;
 	}
 
 	up_read(&mm->mmap_sem);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (8 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-02-04 15:28   ` Barret Rhoden
  2020-02-04  1:24 ` [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Dan Williams
  10 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

User can define regions with 'memmap=size!offset' which in turn
creates PMEM legacy devices. But because it is a label-less
NVDIMM device we only have one namespace for the whole device.

Add support for multiple namespaces by adding ndctl control
support, and exposing a minimal set of features:
(ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
store labels.

Initialization is a little different: We allocate and register an
nvdimm bus with an @nvdimm_descriptor which we use to locate
where we are keeping our label storage area. The config data
get/set/size operations are then simply memcpying to this area.

Equivalent approach can also be found in the NFIT tests which
emulate the same thing.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/nvdimm/e820.c | 212 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 191 insertions(+), 21 deletions(-)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index e02f60ad6c99..36fbff3d7110 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -7,14 +7,21 @@
 #include <linux/memory_hotplug.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/ndctl.h>
+#include <linux/nd.h>
 
-static int e820_pmem_remove(struct platform_device *pdev)
-{
-	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
+#define LABEL_SIZE SZ_128K
 
-	nvdimm_bus_unregister(nvdimm_bus);
-	return 0;
-}
+struct e820_descriptor {
+	struct nd_interleave_set nd_set;
+	struct nvdimm_bus_descriptor nd_desc;
+	void *label;
+	unsigned char cookie1[16];
+	unsigned char cookie2[16];
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm *nvdimm;
+};
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 static int e820_range_to_nid(resource_size_t addr)
@@ -28,43 +35,206 @@ static int e820_range_to_nid(resource_size_t addr)
 }
 #endif
 
+static int e820_get_config_size(struct nd_cmd_get_config_size *nd_cmd,
+				unsigned int buf_len)
+{
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+
+	nd_cmd->status = 0;
+	nd_cmd->config_size = LABEL_SIZE;
+	nd_cmd->max_xfer = SZ_4K;
+
+	return 0;
+}
+
+static int e820_get_config_data(struct nd_cmd_get_config_data_hdr
+		*nd_cmd, unsigned int buf_len, void *label)
+{
+	unsigned int len, offset = nd_cmd->in_offset;
+	int rc;
+
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+	if (offset >= LABEL_SIZE)
+		return -EINVAL;
+	if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+		return -EINVAL;
+
+	nd_cmd->status = 0;
+	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+	memcpy(nd_cmd->out_buf, label + offset, len);
+	rc = buf_len - sizeof(*nd_cmd) - len;
+
+	return rc;
+}
+
+static int e820_set_config_data(struct nd_cmd_set_config_hdr *nd_cmd,
+		unsigned int buf_len, void *label)
+{
+	unsigned int len, offset = nd_cmd->in_offset;
+	u32 *status;
+	int rc;
+
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+	if (offset >= LABEL_SIZE)
+		return -EINVAL;
+	if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+		return -EINVAL;
+
+	status = (void *)nd_cmd + nd_cmd->in_length + sizeof(*nd_cmd);
+	*status = 0;
+	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+	memcpy(label + offset, nd_cmd->in_buf, len);
+	rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+
+	return rc;
+}
+
+static struct e820_descriptor *to_e820_desc(struct nvdimm_bus_descriptor *desc)
+{
+	return container_of(desc, struct e820_descriptor, nd_desc);
+}
+
+static int e820_ndctl(struct nvdimm_bus_descriptor *nd_desc,
+			 struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+			 unsigned int buf_len, int *cmd_rc)
+{
+	struct e820_descriptor *t = to_e820_desc(nd_desc);
+	int rc = -EINVAL;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE:
+		rc = e820_get_config_size(buf, buf_len);
+		break;
+	case ND_CMD_GET_CONFIG_DATA:
+		rc = e820_get_config_data(buf, buf_len, t->label);
+		break;
+	case ND_CMD_SET_CONFIG_DATA:
+		rc = e820_set_config_data(buf, buf_len, t->label);
+		break;
+	default:
+		return rc;
+	}
+
+	return rc;
+}
+
+static void e820_desc_free(struct e820_descriptor *desc)
+{
+	if (!desc)
+		return;
+
+	nvdimm_bus_unregister(desc->nvdimm_bus);
+	kfree(desc->label);
+	kfree(desc);
+}
+
+static struct e820_descriptor *e820_desc_alloc(struct platform_device *pdev)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	unsigned int cmd_mask, dimm_flags;
+	struct device *dev = &pdev->dev;
+	struct nvdimm_bus *nvdimm_bus;
+	struct e820_descriptor *desc;
+	struct nvdimm *nvdimm;
+
+	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+	if (!desc)
+		goto err;
+
+	desc->label = kzalloc(LABEL_SIZE, GFP_KERNEL);
+	if (!desc->label)
+		goto err;
+
+	nd_desc = &desc->nd_desc;
+	nd_desc->provider_name = "e820";
+	nd_desc->module = THIS_MODULE;
+	nd_desc->ndctl = e820_ndctl;
+	nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+	if (!nvdimm_bus) {
+		dev_err(dev, "nvdimm bus registration failure\n");
+		goto err;
+	}
+	desc->nvdimm_bus = nvdimm_bus;
+
+	cmd_mask = (1UL << ND_CMD_GET_CONFIG_SIZE |
+			1UL << ND_CMD_GET_CONFIG_DATA |
+			1UL << ND_CMD_SET_CONFIG_DATA);
+	dimm_flags = (1UL << NDD_ALIASING);
+	nvdimm = nvdimm_create(nvdimm_bus, pdev, NULL,
+				dimm_flags, cmd_mask, 0, NULL);
+	if (!nvdimm) {
+		dev_err(dev, "nvdimm creation failure\n");
+		goto err;
+	}
+	desc->nvdimm = nvdimm;
+	return desc;
+
+err:
+	e820_desc_free(desc);
+	return NULL;
+}
+
 static int e820_register_one(struct resource *res, void *data)
 {
+	struct platform_device *pdev = data;
 	struct nd_region_desc ndr_desc;
-	struct nvdimm_bus *nvdimm_bus = data;
+	struct nd_mapping_desc mapping;
+	struct e820_descriptor *desc;
+
+	desc = e820_desc_alloc(pdev);
+	if (!desc)
+		return -ENOMEM;
+
+	mapping.nvdimm = desc->nvdimm;
+	mapping.start = res->start;
+	mapping.size = resource_size(res);
+	mapping.position = 0;
+
+	generate_random_uuid(desc->cookie1);
+	desc->nd_set.cookie1 = (u64) desc->cookie1;
+	generate_random_uuid(desc->cookie2);
+	desc->nd_set.cookie2 = (u64) desc->cookie2;
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	ndr_desc.res = res;
 	ndr_desc.numa_node = e820_range_to_nid(res->start);
 	ndr_desc.target_node = ndr_desc.numa_node;
+	ndr_desc.mapping = &mapping;
+	ndr_desc.num_mappings = 1;
+	ndr_desc.nd_set = &desc->nd_set;
 	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
-	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+	if (!nvdimm_pmem_region_create(desc->nvdimm_bus, &ndr_desc)) {
+		e820_desc_free(desc);
+		dev_err(&pdev->dev, "nvdimm region creation failure\n");
 		return -ENXIO;
+	}
+
+	platform_set_drvdata(pdev, desc);
+	return 0;
+}
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+	struct e820_descriptor *desc = platform_get_drvdata(pdev);
+
+	e820_desc_free(desc);
 	return 0;
 }
 
 static int e820_pmem_probe(struct platform_device *pdev)
 {
-	static struct nvdimm_bus_descriptor nd_desc;
-	struct device *dev = &pdev->dev;
-	struct nvdimm_bus *nvdimm_bus;
 	int rc = -ENXIO;
 
-	nd_desc.provider_name = "e820";
-	nd_desc.module = THIS_MODULE;
-	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
-	if (!nvdimm_bus)
-		goto err;
-	platform_set_drvdata(pdev, nvdimm_bus);
-
 	rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
-			IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
+			IORESOURCE_MEM, 0, -1, pdev, e820_register_one);
 	if (rc)
 		goto err;
 	return 0;
 err:
-	nvdimm_bus_unregister(nvdimm_bus);
-	dev_err(dev, "failed to register legacy persistent memory ranges\n");
+	dev_err(&pdev->dev, "failed to register legacy persistent memory ranges\n");
 	return rc;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
@ 2020-02-03 21:34   ` Matthew Wilcox
  2020-02-04 16:14     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2020-02-03 21:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Konrad Rzeszutek Wilk

On Fri, Jan 10, 2020 at 07:03:04PM +0000, Joao Martins wrote:
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
>  {
>  	return 0;
>  }
> +#endif
> +
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +static inline int pmd_special(pmd_t pmd)
> +{
> +	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
> +}
> +#endif

The ifdef/endif don't make much sense here; x86 does have PTE_SPECIAL,
and this is an x86 header file, so that can be assumed.

> +++ b/mm/gup.c
> @@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
>  	}
>  
> +	if (pmd_special(orig))
> +		return 0;

Here, you're calling it unconditionally.  I think you need a pmd_special()
conditionally defined in include/asm-generic/pgtable.h

+#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline bool pmd_special(pmd_t pmd)
+{
+	return false;
+}
+#endif

(oh, and plese use bool instead of int; I know that's different from
pte_special(), but pte_special() predates bool and nobody's done the work
to convert it yet)

> +++ b/mm/huge_memory.c
> @@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>  	if (pfn_t_devmap(pfn))
>  		entry = pmd_mkdevmap(entry);
> +	else if (pfn_t_special(pfn))
> +		entry = pmd_mkspecial(entry);

Again, we'll need a generic one.

> @@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	 * but we need to be consistent with PTEs and architectures that
>  	 * can't support a 'special' bit.
>  	 */
> -	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
> -			!pfn_t_devmap(pfn));
> +	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));

Should that rather be ...

+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
+			!pfn_t_devmap(pfn) && !pfn_t_special(pfn));

I also think this comment needs adjusting:

        /*
         * There is no pmd_special() but there may be special pmds, e.g.
         * in a direct-access (dax) mapping, so let's just replicate the
         * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
         */



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn()
  2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
@ 2020-02-03 21:37   ` Matthew Wilcox
  2020-02-04 16:17     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2020-02-03 21:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Konrad Rzeszutek Wilk

On Fri, Jan 10, 2020 at 07:03:05PM +0000, Joao Martins wrote:
> @@ -4366,6 +4366,7 @@ EXPORT_SYMBOL(follow_pte_pmd);
>  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>  	unsigned long *pfn)
>  {
> +	pmd_t *pmdpp = NULL;

Please rename to 'pmdp'.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (9 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
@ 2020-02-04  1:24 ` Dan Williams
  2020-02-04 19:07   ` Joao Martins
  10 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2020-02-04  1:24 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Barret Rhoden, Boris Ostrovsky,
	Matthew Wilcox, Konrad Rzeszutek Wilk

On Fri, Jan 10, 2020 at 11:06 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,

Hi Joao,

>
> Presented herewith a small series which allows device-dax to work without
> struct page to be used to back KVM guests memory. It's an RFC, and there's
> still some items we're looking at (see TODO below);

So it turns out I already have some patches in flight that address
discontiguous allocation item. Here's a WIP branch that I'll be
sending out after the merge window closes.

https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

> but wondering if folks
> would be OK carving some time out of their busy schedules to provide feedback
> direction-wise on this work.

...apologies I did not get to this sooner. Please feel free to ping me
after a week if you're awaiting comment on anything in the nvdimm or
device-dax area.

> In virtualized environments (specially those with no kernel-backed PV
> interfaces, and just SR-IOV), memory is largelly assigned to guests: either
> persistent with NVDIMMs or volatile for regular RAM. The kernel
> (hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
> we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
> need  which could instead be used to create other guests. One of motivations of
> this series is to then get that memory used for 'struct page', when it is meant
> to solely be used by userspace.

Do you mean reclaim it for the guest to use for 'struct page' capacity
since the host hypervisor has reduced need for it?

> This is also useful for the case of memory
> backing guests virtual NVDIMMs. The other neat side effect is that the
> hypervisor has no virtual mapping of the guest and hence code gadgets (if
> found) are limited in their effectiveness.

You mean just the direct-map? qemu still gets a valid virtual address,
or are you proposing it's not mapped there either?

> It is expected that a smaller (instead of total) amount of host memory is
> defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
> QEMU), the main thing that is needed is a device which open + mmap + close with
> a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
> just that and so the work comprised here was improving what's there and the
> interfaces it uses.

In general I think this "'struct page'-less device-dax" capability
makes sense, why suffer 1.6% capacity loss when that memory is going
unused? My main concerns are:

1/ Use of memmap=nn!ss when none of the persistent memory
infrastructure is needed.

2/ Auditing the new usage of VM_PFNMAP and what that means for
memory-error handling and applications that expect 'struct page'
services to be present.

3/ "page-less" dreams have been dashed on the rocks in the past. The
slow drip of missing features (ptrace(), direct-I/O, fork()...) is why
the kernel now requires them by default for dax.

For 1/ I have a proposal, for 2/ I need to dig in to what you have
here, but maybe I can trade you some review on the discontiguous
allocation patches. For 3/ can you say a bit more about why the
overhead is intolerable?

The proposal for 1/ is please let's not build more on top of
memmap=nn!ss. It's fragile because there's no facility to validate
that it is correct, the assumption is that the user overriding the
memmap knows to pick address that don't collide and have real memory
backing. Practice has shown that people get it wrong often enough that
we need something better. It's also confusing that applications start
seeing "Persistent Memory" in /proc/iomem when it's only there to
shunt memory over to device-dax.

The alternative is "efi_fake_mem=nn@ss:0x40000" and the related
'soft-reservation' enabling that bypasses the persistent memory
enabling and assigns memory to device-dax directly. EFI attribute
override is safer because this attribute is only honored when applied
to existing EFI conventional memory.

Have a look at that branch just be warned it's not polished, or just
wait for me to send them out with a proper cover letter, and I'll take
a look at what you have below.

>
> The series is divided as follows:
>
>  * Patch 1 , 3: Preparatory work for patch 7 for adding support for
>                vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
>
>  * Patch 2 , 4: Preparatory work for patch 7 for adding support for
>                follow_pfn() to work with 2M/1G huge pages, which is
>                what KVM uses for VM_PFNMAP.
>
>  * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
>                which encompasses mainly dealing with the lack of devmap,
>                and creating a VM_PFNMAP vma.
>
>  * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
>            At the very end of the cover letter (after scissors mark),
>            there's a patch for ndctl to be able to create namespaces
>            with '--mode devdax --map none'.
>
>  * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
>             a PFN.
>
>  * Patch 10: The actual end consumer example for RAM case. The patch just adds a
>              label storage area which consequently allows namespaces to be
>              created. We picked PMEM legacy for starters.
>
> Thoughts, coments appreciated.
>         Joao
>
> P.S. As an example to try this out:
>
>  1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
>  and kernel has 16G.
>
>  2) create a devdax namespace with 1G hugepages:
>
>  $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
>  {
>   "dev":"namespace0.0",
>   "mode":"devdax",
>   "map":"none",
>   "size":"32.00 GiB (34.36 GB)",
>   "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
>   "daxregion":{
>     "id":0,
>     "size":"32.00 GiB (34.36 GB)",
>     "align":1073741824,
>     "devices":[
>       {
>         "chardev":"dax0.0",
>         "size":"32.00 GiB (34.36 GB)",
>         "target_node":0,
>         "mode":"devdax"
>       }
>     ]
>   },
>   "align":1073741824
>  }
>
>  3) Add this to your qemu params:
>   -m 32G
>   -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
>   -numa node,memdev=mem
>
> TODO:
>
>  * Discontiguous regions/namespaces: The work above is limited to max
> contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
> is because of what specs allow for persistent namespaces. But for volatile RAM
> case we would need handling of discontiguous extents (hence a region would represent
> more than a resource) to be less bound to how guests are placed on the system.
> I played around with multi-resource for device-dax, but I'm wondering about
> UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
> case (under certain nvdimm region bits); or if 2) device-dax would have it's
> own separate UABI to be used by daxctl (which would be also useful for hmem
> devices?).
>
>  * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
> device-dax, which would allow collect_procs() to find the processes solely based
> on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
> looking at the dax-region pfn range for each DAX vma.

You mean, make the memory error handling device-dax aware?

>
>  * NUMA: For now excluded setting the target_node; while these two patches
>  are being worked on[1][2].
>
>  [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
>  [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/

I'll ping x86 folks again after the merge window. I expect they have
just not had time to ack them yet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
@ 2020-02-04 15:28   ` Barret Rhoden
  2020-02-04 16:44     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Barret Rhoden @ 2020-02-04 15:28 UTC (permalink / raw)
  To: Joao Martins, linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Hi -

On 1/10/20 2:03 PM, Joao Martins wrote:
> User can define regions with 'memmap=size!offset' which in turn
> creates PMEM legacy devices. But because it is a label-less
> NVDIMM device we only have one namespace for the whole device.
> 
> Add support for multiple namespaces by adding ndctl control
> support, and exposing a minimal set of features:
> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> store labels.

FWIW, I like this a lot.  If we move away from using memmap in favor of 
efi_fake_mem, ideally we'd have the same support for full-fledged 
pmem/dax regions and namespaces that this patch brings.

Thanks,
Barret


> 
> Initialization is a little different: We allocate and register an
> nvdimm bus with an @nvdimm_descriptor which we use to locate
> where we are keeping our label storage area. The config data
> get/set/size operations are then simply memcpying to this area.
> 
> Equivalent approach can also be found in the NFIT tests which
> emulate the same thing.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/nvdimm/e820.c | 212 +++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 191 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index e02f60ad6c99..36fbff3d7110 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -7,14 +7,21 @@
>   #include <linux/memory_hotplug.h>
>   #include <linux/libnvdimm.h>
>   #include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/ndctl.h>
> +#include <linux/nd.h>
>   
> -static int e820_pmem_remove(struct platform_device *pdev)
> -{
> -	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
> +#define LABEL_SIZE SZ_128K
>   
> -	nvdimm_bus_unregister(nvdimm_bus);
> -	return 0;
> -}
> +struct e820_descriptor {
> +	struct nd_interleave_set nd_set;
> +	struct nvdimm_bus_descriptor nd_desc;
> +	void *label;
> +	unsigned char cookie1[16];
> +	unsigned char cookie2[16];
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nvdimm *nvdimm;
> +};
>   
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   static int e820_range_to_nid(resource_size_t addr)
> @@ -28,43 +35,206 @@ static int e820_range_to_nid(resource_size_t addr)
>   }
>   #endif
>   
> +static int e820_get_config_size(struct nd_cmd_get_config_size *nd_cmd,
> +				unsigned int buf_len)
> +{
> +	if (buf_len < sizeof(*nd_cmd))
> +		return -EINVAL;
> +
> +	nd_cmd->status = 0;
> +	nd_cmd->config_size = LABEL_SIZE;
> +	nd_cmd->max_xfer = SZ_4K;
> +
> +	return 0;
> +}
> +
> +static int e820_get_config_data(struct nd_cmd_get_config_data_hdr
> +		*nd_cmd, unsigned int buf_len, void *label)
> +{
> +	unsigned int len, offset = nd_cmd->in_offset;
> +	int rc;
> +
> +	if (buf_len < sizeof(*nd_cmd))
> +		return -EINVAL;
> +	if (offset >= LABEL_SIZE)
> +		return -EINVAL;
> +	if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
> +		return -EINVAL;
> +
> +	nd_cmd->status = 0;
> +	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
> +	memcpy(nd_cmd->out_buf, label + offset, len);
> +	rc = buf_len - sizeof(*nd_cmd) - len;
> +
> +	return rc;
> +}
> +
> +static int e820_set_config_data(struct nd_cmd_set_config_hdr *nd_cmd,
> +		unsigned int buf_len, void *label)
> +{
> +	unsigned int len, offset = nd_cmd->in_offset;
> +	u32 *status;
> +	int rc;
> +
> +	if (buf_len < sizeof(*nd_cmd))
> +		return -EINVAL;
> +	if (offset >= LABEL_SIZE)
> +		return -EINVAL;
> +	if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
> +		return -EINVAL;
> +
> +	status = (void *)nd_cmd + nd_cmd->in_length + sizeof(*nd_cmd);
> +	*status = 0;
> +	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
> +	memcpy(label + offset, nd_cmd->in_buf, len);
> +	rc = buf_len - sizeof(*nd_cmd) - (len + 4);
> +
> +	return rc;
> +}
> +
> +static struct e820_descriptor *to_e820_desc(struct nvdimm_bus_descriptor *desc)
> +{
> +	return container_of(desc, struct e820_descriptor, nd_desc);
> +}
> +
> +static int e820_ndctl(struct nvdimm_bus_descriptor *nd_desc,
> +			 struct nvdimm *nvdimm, unsigned int cmd, void *buf,
> +			 unsigned int buf_len, int *cmd_rc)
> +{
> +	struct e820_descriptor *t = to_e820_desc(nd_desc);
> +	int rc = -EINVAL;
> +
> +	switch (cmd) {
> +	case ND_CMD_GET_CONFIG_SIZE:
> +		rc = e820_get_config_size(buf, buf_len);
> +		break;
> +	case ND_CMD_GET_CONFIG_DATA:
> +		rc = e820_get_config_data(buf, buf_len, t->label);
> +		break;
> +	case ND_CMD_SET_CONFIG_DATA:
> +		rc = e820_set_config_data(buf, buf_len, t->label);
> +		break;
> +	default:
> +		return rc;
> +	}
> +
> +	return rc;
> +}
> +
> +static void e820_desc_free(struct e820_descriptor *desc)
> +{
> +	if (!desc)
> +		return;
> +
> +	nvdimm_bus_unregister(desc->nvdimm_bus);
> +	kfree(desc->label);
> +	kfree(desc);
> +}
> +
> +static struct e820_descriptor *e820_desc_alloc(struct platform_device *pdev)
> +{
> +	struct nvdimm_bus_descriptor *nd_desc;
> +	unsigned int cmd_mask, dimm_flags;
> +	struct device *dev = &pdev->dev;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct e820_descriptor *desc;
> +	struct nvdimm *nvdimm;
> +
> +	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> +	if (!desc)
> +		goto err;
> +
> +	desc->label = kzalloc(LABEL_SIZE, GFP_KERNEL);
> +	if (!desc->label)
> +		goto err;
> +
> +	nd_desc = &desc->nd_desc;
> +	nd_desc->provider_name = "e820";
> +	nd_desc->module = THIS_MODULE;
> +	nd_desc->ndctl = e820_ndctl;
> +	nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
> +	if (!nvdimm_bus) {
> +		dev_err(dev, "nvdimm bus registration failure\n");
> +		goto err;
> +	}
> +	desc->nvdimm_bus = nvdimm_bus;
> +
> +	cmd_mask = (1UL << ND_CMD_GET_CONFIG_SIZE |
> +			1UL << ND_CMD_GET_CONFIG_DATA |
> +			1UL << ND_CMD_SET_CONFIG_DATA);
> +	dimm_flags = (1UL << NDD_ALIASING);
> +	nvdimm = nvdimm_create(nvdimm_bus, pdev, NULL,
> +				dimm_flags, cmd_mask, 0, NULL);
> +	if (!nvdimm) {
> +		dev_err(dev, "nvdimm creation failure\n");
> +		goto err;
> +	}
> +	desc->nvdimm = nvdimm;
> +	return desc;
> +
> +err:
> +	e820_desc_free(desc);
> +	return NULL;
> +}
> +
>   static int e820_register_one(struct resource *res, void *data)
>   {
> +	struct platform_device *pdev = data;
>   	struct nd_region_desc ndr_desc;
> -	struct nvdimm_bus *nvdimm_bus = data;
> +	struct nd_mapping_desc mapping;
> +	struct e820_descriptor *desc;
> +
> +	desc = e820_desc_alloc(pdev);
> +	if (!desc)
> +		return -ENOMEM;
> +
> +	mapping.nvdimm = desc->nvdimm;
> +	mapping.start = res->start;
> +	mapping.size = resource_size(res);
> +	mapping.position = 0;
> +
> +	generate_random_uuid(desc->cookie1);
> +	desc->nd_set.cookie1 = (u64) desc->cookie1;
> +	generate_random_uuid(desc->cookie2);
> +	desc->nd_set.cookie2 = (u64) desc->cookie2;
>   
>   	memset(&ndr_desc, 0, sizeof(ndr_desc));
>   	ndr_desc.res = res;
>   	ndr_desc.numa_node = e820_range_to_nid(res->start);
>   	ndr_desc.target_node = ndr_desc.numa_node;
> +	ndr_desc.mapping = &mapping;
> +	ndr_desc.num_mappings = 1;
> +	ndr_desc.nd_set = &desc->nd_set;
>   	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> -	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
> +	if (!nvdimm_pmem_region_create(desc->nvdimm_bus, &ndr_desc)) {
> +		e820_desc_free(desc);
> +		dev_err(&pdev->dev, "nvdimm region creation failure\n");
>   		return -ENXIO;
> +	}
> +
> +	platform_set_drvdata(pdev, desc);
> +	return 0;
> +}
> +
> +static int e820_pmem_remove(struct platform_device *pdev)
> +{
> +	struct e820_descriptor *desc = platform_get_drvdata(pdev);
> +
> +	e820_desc_free(desc);
>   	return 0;
>   }
>   
>   static int e820_pmem_probe(struct platform_device *pdev)
>   {
> -	static struct nvdimm_bus_descriptor nd_desc;
> -	struct device *dev = &pdev->dev;
> -	struct nvdimm_bus *nvdimm_bus;
>   	int rc = -ENXIO;
>   
> -	nd_desc.provider_name = "e820";
> -	nd_desc.module = THIS_MODULE;
> -	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
> -	if (!nvdimm_bus)
> -		goto err;
> -	platform_set_drvdata(pdev, nvdimm_bus);
> -
>   	rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
> -			IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
> +			IORESOURCE_MEM, 0, -1, pdev, e820_register_one);
>   	if (rc)
>   		goto err;
>   	return 0;
>   err:
> -	nvdimm_bus_unregister(nvdimm_bus);
> -	dev_err(dev, "failed to register legacy persistent memory ranges\n");
> +	dev_err(&pdev->dev, "failed to register legacy persistent memory ranges\n");
>   	return rc;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL
  2020-02-03 21:34   ` Matthew Wilcox
@ 2020-02-04 16:14     ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-02-04 16:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Konrad Rzeszutek Wilk

On 2/3/20 9:34 PM, Matthew Wilcox wrote:
> On Fri, Jan 10, 2020 at 07:03:04PM +0000, Joao Martins wrote:
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
>>  {
>>  	return 0;
>>  }
>> +#endif
>> +
>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> +static inline int pmd_special(pmd_t pmd)
>> +{
>> +	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
>> +}
>> +#endif
> 
> The ifdef/endif don't make much sense here; x86 does have PTE_SPECIAL,
> and this is an x86 header file, so that can be assumed.
> 
Gotcha.

>> +++ b/mm/gup.c
>> @@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>>  		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
>>  	}
>>  
>> +	if (pmd_special(orig))
>> +		return 0;
> 
> Here, you're calling it unconditionally.  I think you need a pmd_special()
> conditionally defined in include/asm-generic/pgtable.h
> 
> +#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
> +static inline bool pmd_special(pmd_t pmd)
> +{
> +	return false;
> +}
> +#endif
> 
> (oh, and plese use bool instead of int; I know that's different from
> pte_special(), but pte_special() predates bool and nobody's done the work
> to convert it yet)
> 
Got it.

>> +++ b/mm/huge_memory.c
>> @@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>>  	if (pfn_t_devmap(pfn))
>>  		entry = pmd_mkdevmap(entry);
>> +	else if (pfn_t_special(pfn))
>> +		entry = pmd_mkspecial(entry);
> 
> Again, we'll need a generic one.
> 
Will add it.

>> @@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>  	 * but we need to be consistent with PTEs and architectures that
>>  	 * can't support a 'special' bit.
>>  	 */
>> -	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
>> -			!pfn_t_devmap(pfn));
>> +	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
> 
> Should that rather be ...
> 
> +	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
> +			!pfn_t_devmap(pfn) && !pfn_t_special(pfn));
> 
Yes. That is indeed a mistake I had already fixed for v2. Patch 3 does the exact
same, so as the other comments you mentioned here too so will adjust that
accordingly.

> I also think this comment needs adjusting:
> 
>         /*
>          * There is no pmd_special() but there may be special pmds, e.g.
>          * in a direct-access (dax) mapping, so let's just replicate the
>          * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
>          */
> 
> 
I'll replace with what vm_normal_page() equivalent has:

	/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */

  Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn()
  2020-02-03 21:37   ` Matthew Wilcox
@ 2020-02-04 16:17     ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-02-04 16:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Konrad Rzeszutek Wilk

On 2/3/20 9:37 PM, Matthew Wilcox wrote:
> On Fri, Jan 10, 2020 at 07:03:05PM +0000, Joao Martins wrote:
>> @@ -4366,6 +4366,7 @@ EXPORT_SYMBOL(follow_pte_pmd);
>>  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>>  	unsigned long *pfn)
>>  {
>> +	pmd_t *pmdpp = NULL;
> 
> Please rename to 'pmdp'.
> 
Will do.

Alongside patch 4 usage of pmdpp and renaming 'pudpp' to 'pudp'.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-02-04 15:28   ` Barret Rhoden
@ 2020-02-04 16:44     ` Dan Williams
  2020-02-04 18:20       ` Barret Rhoden
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2020-02-04 16:44 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Joao Martins, linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Boris Ostrovsky, Matthew Wilcox,
	Konrad Rzeszutek Wilk

On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <brho@google.com> wrote:
>
> Hi -
>
> On 1/10/20 2:03 PM, Joao Martins wrote:
> > User can define regions with 'memmap=size!offset' which in turn
> > creates PMEM legacy devices. But because it is a label-less
> > NVDIMM device we only have one namespace for the whole device.
> >
> > Add support for multiple namespaces by adding ndctl control
> > support, and exposing a minimal set of features:
> > (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> > ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> > store labels.
>
> FWIW, I like this a lot.  If we move away from using memmap in favor of
> efi_fake_mem, ideally we'd have the same support for full-fledged
> pmem/dax regions and namespaces that this patch brings.

No, efi_fake_mem only supports creating dax-regions. What's the use
case that can't be satisfied by just specifying multiple memmap=
ranges?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-02-04 16:44     ` Dan Williams
@ 2020-02-04 18:20       ` Barret Rhoden
  2020-02-04 19:24         ` Joao Martins
  2020-02-04 21:43         ` Dan Williams
  0 siblings, 2 replies; 26+ messages in thread
From: Barret Rhoden @ 2020-02-04 18:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Boris Ostrovsky, Matthew Wilcox,
	Konrad Rzeszutek Wilk

Hi -

On 2/4/20 11:44 AM, Dan Williams wrote:
> On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <brho@google.com> wrote:
>>
>> Hi -
>>
>> On 1/10/20 2:03 PM, Joao Martins wrote:
>>> User can define regions with 'memmap=size!offset' which in turn
>>> creates PMEM legacy devices. But because it is a label-less
>>> NVDIMM device we only have one namespace for the whole device.
>>>
>>> Add support for multiple namespaces by adding ndctl control
>>> support, and exposing a minimal set of features:
>>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
>>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
>>> store labels.
>>
>> FWIW, I like this a lot.  If we move away from using memmap in favor of
>> efi_fake_mem, ideally we'd have the same support for full-fledged
>> pmem/dax regions and namespaces that this patch brings.
> 
> No, efi_fake_mem only supports creating dax-regions. What's the use
> case that can't be satisfied by just specifying multiple memmap=
> ranges?

I'd like to be able to create and destroy dax regions on the fly.  In 
particular, I want to run guest VMs using the dax files for guest 
memory, but I don't know at boot time how many VMs I'll have, or what 
their sizes are.  Ideally, I'd have separate files for each VM, instead 
of a single /dev/dax.

I currently do this with fs-dax with one big memmap region (ext4 on 
/dev/pmem0), and I use the file system to handle the 
creation/destruction/resizing and metadata management.  But since fs-dax 
won't work with device pass-through, I started looking at dev-dax, with 
the expectation that I'd need some software to manage the memory (i.e. 
allocation).  That led me to ndctl, which seems to need namespace labels 
to have the level of control I was looking for.

Thanks,

Barret


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata
  2020-02-04  1:24 ` [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Dan Williams
@ 2020-02-04 19:07   ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-02-04 19:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Barret Rhoden, Boris Ostrovsky,
	Matthew Wilcox, Konrad Rzeszutek Wilk

On 2/4/20 1:24 AM, Dan Williams wrote:
> On Fri, Jan 10, 2020 at 11:06 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Hey,
> 
> Hi Joao,
> 
>>
>> Presented herewith a small series which allows device-dax to work without
>> struct page to be used to back KVM guests memory. It's an RFC, and there's
>> still some items we're looking at (see TODO below);
> 
> So it turns out I already have some patches in flight that address
> discontiguous allocation item. Here's a WIP branch that I'll be
> sending out after the merge window closes.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending
> 
Neat!

>> but wondering if folks
>> would be OK carving some time out of their busy schedules to provide feedback
>> direction-wise on this work.
> 
> ...apologies I did not get to this sooner. Please feel free to ping me
> after a week if you're awaiting comment on anything in the nvdimm or
> device-dax area.
> 
OK, got it.

>> In virtualized environments (specially those with no kernel-backed PV
>> interfaces, and just SR-IOV), memory is largelly assigned to guests: either
>> persistent with NVDIMMs or volatile for regular RAM. The kernel
>> (hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
>> we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
>> need  which could instead be used to create other guests. One of motivations of
>> this series is to then get that memory used for 'struct page', when it is meant
>> to solely be used by userspace.
> 
> Do you mean reclaim it for the guest to use for 'struct page' capacity
> since the host hypervisor has reduced need for it?
> 
Yes.

>> This is also useful for the case of memory
>> backing guests virtual NVDIMMs. The other neat side effect is that the
>> hypervisor has no virtual mapping of the guest and hence code gadgets (if
>> found) are limited in their effectiveness.
> 
> You mean just the direct-map? 

Yes.

> qemu still gets a valid virtual address,
> or are you proposing it's not mapped there either?
> 
Correct, Qemu still has a valid virtual address.

>> It is expected that a smaller (instead of total) amount of host memory is
>> defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
>> QEMU), the main thing that is needed is a device which open + mmap + close with
>> a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
>> just that and so the work comprised here was improving what's there and the
>> interfaces it uses.
> 
> In general I think this "'struct page'-less device-dax" capability
> makes sense, why suffer 1.6% capacity loss when that memory is going
> unused? My main concerns are:

Exactly!

> 
> 1/ Use of memmap=nn!ss when none of the persistent memory
> infrastructure is needed.
> 
> 2/ Auditing the new usage of VM_PFNMAP and what that means for
> memory-error handling and applications that expect 'struct page'
> services to be present.
> 
> 3/ "page-less" dreams have been dashed on the rocks in the past. The
> slow drip of missing features (ptrace(), direct-I/O, fork()...) is why
> the kernel now requires them by default for dax.
> 

wrt to ptrace() the only usage I could find was PEEKDATA/POKEDATA/etc
(essentially access_remote_vm()). Which if devdax defines a access() which
copies back from/to pfn from/to user buffer then it should work. Unless there
something else that I am missing?

> For 1/ I have a proposal, for 2/ I need to dig in to what you have
> here, but maybe I can trade you some review on the discontiguous
> allocation patches.

Will definitely look at those patches. Before this series I actually had started
with hmem (based on your initial work of multi-resource dax) and had patches too
for daxctl (to create/destroy the devices backed with soft reserved memory),
before switching to nd_e820. Let me know where I can help. I can throw those
away and build on top of your work that is on that branch.

For 1), part of the reason to start with memmap= was because the series was more
about devdax infra than the 'partitioning' itself. But it wasn't clear to me
which one would be best fitted layer-wise and given the devdax changes are
somewhat orthogonal then figured I would settle with something simpler that is
reusing a common facility and hear out opinions.

For 2) the idea is to keep the 'struct page' required services usage to the
minimum (or none) necessary. For example, if you have a guest with only VFIO
instances you really don't need much (aside from MCE, and optionally ptrace
attach). For kernel PV, the ones that would require work would be vhost-scsi
which would need the equivalent of (vhost_net.experimental_zcopytx=0 -- default
since v5.3) and therefore copy as opposed to do gup. I have a patch for that.

But as you can see in these patches I am trying to keep the same properties as
existing pte special based mappings. Should we need a 'struct page' I wonder if
we could dynamically create/delete the struct page (e.g. maybe via
find_special_page()).

> For 3/ can you say a bit more about why the
> overhead is intolerable?
> 

It really is about extra memory 'wasted' when the hypervisor purpose is to
service back those same resources (e.g. memory, IO) to guests.

In terms of percentage it's somewhat reductory when we use 1.6% as overhead (or
16GB per Tb). When we take into account that most of hypervisor applications
required memory should be kept small[*], then you quickly realize those 'lost'
16Gb consumed for 'struct page' is a lot. The hugepages (which you also get with
dexdax with a devmap) also help further as you deterministically require less
memory for page table entries.

[*] as an arbritary example: say only only up to 18Gb or less for hypervisor
applications/orchestration in a 1Tb machine.

This is also orthogonal to volatile RAM; I suppose we have the same problem in
persistent memory which is also given to guests via say e.g. Qemu emulated NVDIMMs.

> The proposal for 1/ is please let's not build more on top of
> memmap=nn!ss. It's fragile because there's no facility to validate
> that it is correct, the assumption is that the user overriding the
> memmap knows to pick address that don't collide and have real memory
> backing. Practice has shown that people get it wrong often enough that
> we need something better. 

/nods Agreed.

> It's also confusing that applications start
> seeing "Persistent Memory" in /proc/iomem when it's only there to
> shunt memory over to device-dax.
> 

Right. The one side benefit of existing nvdimm's facilities though is that we
reuse namespace-based partitioning, and so this split of dax sort of ends up
replicating what's already in drivers/nvdimm. But I am not arguing one way or
another; as atm I was already wondering if I am abusing nvdimm or whether this
new layer would be created.

> The alternative is "efi_fake_mem=nn@ss:0x40000" and the related
> 'soft-reservation' enabling that bypasses the persistent memory
> enabling and assigns memory to device-dax directly. EFI attribute
> override is safer because this attribute is only honored when applied
> to existing EFI conventional memory.
> 
/nods

One idea I had was to make this slightly more dynamic as opposed to a hard limit
; and it would be less tied in to the EFI memory map.

Say we would offline certain memory blocks and reassign them back to dax_hmem.
In a way it would be somewhat the reserve of dax_kmem (or rather dax_kmem
understanding of new_id with memory%d devices). Would that be sensible? I am
still sketching this, not yet sure on the extent that I get it cleanly.

> Have a look at that branch just be warned it's not polished, or just
> wait for me to send them out with a proper cover letter, and I'll take
> a look at what you have below.
> 
Cool, awesome.


>>
>> The series is divided as follows:
>>
>>  * Patch 1 , 3: Preparatory work for patch 7 for adding support for
>>                vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
>>
>>  * Patch 2 , 4: Preparatory work for patch 7 for adding support for
>>                follow_pfn() to work with 2M/1G huge pages, which is
>>                what KVM uses for VM_PFNMAP.
>>
>>  * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
>>                which encompasses mainly dealing with the lack of devmap,
>>                and creating a VM_PFNMAP vma.
>>
>>  * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
>>            At the very end of the cover letter (after scissors mark),
>>            there's a patch for ndctl to be able to create namespaces
>>            with '--mode devdax --map none'.
>>
>>  * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
>>             a PFN.
>>
>>  * Patch 10: The actual end consumer example for RAM case. The patch just adds a
>>              label storage area which consequently allows namespaces to be
>>              created. We picked PMEM legacy for starters.
>>
>> Thoughts, coments appreciated.
>>         Joao
>>
>> P.S. As an example to try this out:
>>
>>  1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
>>  and kernel has 16G.
>>
>>  2) create a devdax namespace with 1G hugepages:
>>
>>  $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
>>  {
>>   "dev":"namespace0.0",
>>   "mode":"devdax",
>>   "map":"none",
>>   "size":"32.00 GiB (34.36 GB)",
>>   "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
>>   "daxregion":{
>>     "id":0,
>>     "size":"32.00 GiB (34.36 GB)",
>>     "align":1073741824,
>>     "devices":[
>>       {
>>         "chardev":"dax0.0",
>>         "size":"32.00 GiB (34.36 GB)",
>>         "target_node":0,
>>         "mode":"devdax"
>>       }
>>     ]
>>   },
>>   "align":1073741824
>>  }
>>
>>  3) Add this to your qemu params:
>>   -m 32G
>>   -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
>>   -numa node,memdev=mem
>>
>> TODO:
>>
>>  * Discontiguous regions/namespaces: The work above is limited to max
>> contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
>> is because of what specs allow for persistent namespaces. But for volatile RAM
>> case we would need handling of discontiguous extents (hence a region would represent
>> more than a resource) to be less bound to how guests are placed on the system.
>> I played around with multi-resource for device-dax, but I'm wondering about
>> UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
>> case (under certain nvdimm region bits); or if 2) device-dax would have it's
>> own separate UABI to be used by daxctl (which would be also useful for hmem
>> devices?).
>>

Btw one thing I couldn't tell from specs is whether a persistent namespace is
allowed to be represented with 2 or more discontiguous resources. I know of
block mode namespaces, but I am not sure it is eligible to represent a
'discontiguous' persistent namespace be represented like block-mode namespace
labels despite not having any btt interface. That seemed to work so far (but I
am not testing on an actual NVDIMM); but given we would pursue towards the hmem
approach this is also less relevant.

>>  * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
>> device-dax, which would allow collect_procs() to find the processes solely based
>> on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
>> looking at the dax-region pfn range for each DAX vma.
> 
> You mean, make the memory error handling device-dax aware?
> 

/nods

The reason being that with discontinuous regions the assumption of 'vma PFN =
vm_pgoff + (addr - vm_start)' would no longer be sufficient to determine whether
a vma with PFNMAP has the PFN that triggered the MCE. So far I am having a vma
vm operation which finds the address for a given PFN; (essentially the reverse
of a follow_pfn()). Given that devdax knows which pfns it fills in it would be
cheap. Then memory failure SIGBUS the found process, which then (VMM) does a
vMCE to the guest. Guest then recovers the way it sees fit. This is not in this
series but I was hoping to include on v2.

>>
>>  * NUMA: For now excluded setting the target_node; while these two patches
>>  are being worked on[1][2].
>>
>>  [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
>>  [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> I'll ping x86 folks again after the merge window. I expect they have
> just not had time to ack them yet.
> 
Cool, Thanks for the feedback so far!

	Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-02-04 18:20       ` Barret Rhoden
@ 2020-02-04 19:24         ` Joao Martins
  2020-02-04 21:43         ` Dan Williams
  1 sibling, 0 replies; 26+ messages in thread
From: Joao Martins @ 2020-02-04 19:24 UTC (permalink / raw)
  To: Barret Rhoden, Dan Williams
  Cc: linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Boris Ostrovsky, Matthew Wilcox,
	Konrad Rzeszutek Wilk

On 2/4/20 6:20 PM, Barret Rhoden wrote:
> Hi -
> 
> On 2/4/20 11:44 AM, Dan Williams wrote:
>> On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <brho@google.com> wrote:
>>>
>>> Hi -
>>>
>>> On 1/10/20 2:03 PM, Joao Martins wrote:
>>>> User can define regions with 'memmap=size!offset' which in turn
>>>> creates PMEM legacy devices. But because it is a label-less
>>>> NVDIMM device we only have one namespace for the whole device.
>>>>
>>>> Add support for multiple namespaces by adding ndctl control
>>>> support, and exposing a minimal set of features:
>>>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
>>>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
>>>> store labels.
>>>
>>> FWIW, I like this a lot.  If we move away from using memmap in favor of
>>> efi_fake_mem, ideally we'd have the same support for full-fledged
>>> pmem/dax regions and namespaces that this patch brings.
>>
>> No, efi_fake_mem only supports creating dax-regions. What's the use
>> case that can't be satisfied by just specifying multiple memmap=
>> ranges?
> 
> I'd like to be able to create and destroy dax regions on the fly.  In 
> particular, I want to run guest VMs using the dax files for guest 
> memory, but I don't know at boot time how many VMs I'll have, or what 
> their sizes are.  Ideally, I'd have separate files for each VM, instead 
> of a single /dev/dax.
> 
> I currently do this with fs-dax with one big memmap region (ext4 on 
> /dev/pmem0), and I use the file system to handle the 
> creation/destruction/resizing and metadata management.  But since fs-dax 
> won't work with device pass-through, I started looking at dev-dax, with 
> the expectation that I'd need some software to manage the memory (i.e. 
> allocation).  That led me to ndctl, which seems to need namespace labels 
> to have the level of control I was looking for.
> 

Indeed this is the intent of the patch.

As Barret mentioned, memmap= is limited to the one namespace covering the entire
region, and this would fix it (regardless of namespace mode). Otherwise we gotta
know in advance the amount of guests and its exact sizes, which would be
somewhat unflexible.

But given that it's 'pmem emulation' I thought it was OK to twist the label-less
aspect of nd_e820 (unless there's hardware out there which does this?).

If Dan agrees, I can continue with the patch.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-02-04 18:20       ` Barret Rhoden
  2020-02-04 19:24         ` Joao Martins
@ 2020-02-04 21:43         ` Dan Williams
  2020-02-04 21:57           ` Barret Rhoden
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Williams @ 2020-02-04 21:43 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Joao Martins, linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Boris Ostrovsky, Matthew Wilcox,
	Konrad Rzeszutek Wilk

On Tue, Feb 4, 2020 at 10:20 AM Barret Rhoden <brho@google.com> wrote:
>
> Hi -
>
> On 2/4/20 11:44 AM, Dan Williams wrote:
> > On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <brho@google.com> wrote:
> >>
> >> Hi -
> >>
> >> On 1/10/20 2:03 PM, Joao Martins wrote:
> >>> User can define regions with 'memmap=size!offset' which in turn
> >>> creates PMEM legacy devices. But because it is a label-less
> >>> NVDIMM device we only have one namespace for the whole device.
> >>>
> >>> Add support for multiple namespaces by adding ndctl control
> >>> support, and exposing a minimal set of features:
> >>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> >>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> >>> store labels.
> >>
> >> FWIW, I like this a lot.  If we move away from using memmap in favor of
> >> efi_fake_mem, ideally we'd have the same support for full-fledged
> >> pmem/dax regions and namespaces that this patch brings.
> >
> > No, efi_fake_mem only supports creating dax-regions. What's the use
> > case that can't be satisfied by just specifying multiple memmap=
> > ranges?
>
> I'd like to be able to create and destroy dax regions on the fly.  In
> particular, I want to run guest VMs using the dax files for guest
> memory, but I don't know at boot time how many VMs I'll have, or what
> their sizes are.  Ideally, I'd have separate files for each VM, instead
> of a single /dev/dax.
>
> I currently do this with fs-dax with one big memmap region (ext4 on
> /dev/pmem0), and I use the file system to handle the
> creation/destruction/resizing and metadata management.  But since fs-dax
> won't work with device pass-through, I started looking at dev-dax, with
> the expectation that I'd need some software to manage the memory (i.e.
> allocation).  That led me to ndctl, which seems to need namespace labels
> to have the level of control I was looking for.

Ah, got it, you only ended up at wanting namespace labels because
there was no other way to carve up device-dax. That's changing as part
of the efi_fake_mem= enabling and I have a patch set in the works to
allow discontiguous sub-divisions of a device-dax range. Note that is
this branch rebases frequently:

https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending


>
> Thanks,
>
> Barret
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-02-04 21:43         ` Dan Williams
@ 2020-02-04 21:57           ` Barret Rhoden
  0 siblings, 0 replies; 26+ messages in thread
From: Barret Rhoden @ 2020-02-04 21:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, linux-nvdimm, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, KVM list, Andrew Morton,
	Linux MM, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, X86 ML,
	Liran Alon, Nikita Leshenko, Boris Ostrovsky, Matthew Wilcox,
	Konrad Rzeszutek Wilk

On 2/4/20 4:43 PM, Dan Williams wrote:
> Ah, got it, you only ended up at wanting namespace labels because
> there was no other way to carve up device-dax. That's changing as part
> of the efi_fake_mem= enabling and I have a patch set in the works to
> allow discontiguous sub-divisions of a device-dax range. Note that is
> this branch rebases frequently:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

Cool, thanks.  I'll check it out!

Barret

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
  2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
@ 2020-02-07 21:08   ` Jason Gunthorpe
  2020-02-11 16:23     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2020-02-07 21:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

On Fri, Jan 10, 2020 at 07:03:12PM +0000, Joao Martins wrote:
> From: Nikita Leshenko <nikita.leshchenko@oracle.com>
> 
> Unconditionally interpreting vm_pgoff as a PFN is incorrect.
> 
> VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
> that the VMA doesn't have an associated struct page and is being managed
> directly by something other than the core mmu.
> 
> Use follow_pfn like KVM does to find the PFN.
> 
> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>  drivers/vfio/vfio_iommu_type1.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ada8e6cdb88..1e43581f95ea 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> -		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -		if (is_invalid_reserved_pfn(*pfn))
> -			ret = 0;
> +		ret = follow_pfn(vma, vaddr, pfn);
> +		if (!ret && !is_invalid_reserved_pfn(*pfn))
> +			ret = -EOPNOTSUPP;
>  	}

FWIW this existing code is a huge hack and a security problem.

I'm not sure how you could be successfully using this path on actual
memory without hitting bad bugs?

Fudamentally VFIO can't retain a reference to a page from within a VMA
without some kind of recount/locking/etc to allow the thing that put
the page there to know it is still being used (ie programmed in a
IOMMU) by VFIO.

Otherwise it creates use-after-free style security problems on the
page.

This code needs to be deleted, not extended :(

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
  2020-02-07 21:08   ` Jason Gunthorpe
@ 2020-02-11 16:23     ` Joao Martins
  2020-02-11 16:50       ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2020-02-11 16:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

On 2/7/20 9:08 PM, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2020 at 07:03:12PM +0000, Joao Martins wrote:
>> From: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>
>> Unconditionally interpreting vm_pgoff as a PFN is incorrect.
>>
>> VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
>> that the VMA doesn't have an associated struct page and is being managed
>> directly by something other than the core mmu.
>>
>> Use follow_pfn like KVM does to find the PFN.
>>
>> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>  drivers/vfio/vfio_iommu_type1.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ada8e6cdb88..1e43581f95ea 100644
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>>  
>>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>> -		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> -		if (is_invalid_reserved_pfn(*pfn))
>> -			ret = 0;
>> +		ret = follow_pfn(vma, vaddr, pfn);
>> +		if (!ret && !is_invalid_reserved_pfn(*pfn))
>> +			ret = -EOPNOTSUPP;
>>  	}
> 
> FWIW this existing code is a huge hack and a security problem.
> 
> I'm not sure how you could be successfully using this path on actual
> memory without hitting bad bugs?
> 
ATM I think this codepath is largelly hit at the moment for MMIO (GPU
passthrough, or mdev). In the context of this patch, guest memory would be
treated similarly meaning the device-dax backing memory wouldn't have a 'struct
page' (as introduced in this series).

> Fudamentally VFIO can't retain a reference to a page from within a VMA
> without some kind of recount/locking/etc to allow the thing that put
> the page there to know it is still being used (ie programmed in a
> IOMMU) by VFIO.
> 
> Otherwise it creates use-after-free style security problems on the
> page.
> 
I take it you're referring to the past problems with long term page pinning +
fsdax? Or you had something else in mind, perhaps related to your LSFMM topic?

Here the memory can't be used by the kernel (and there's no struct page) except
from device-dax managing/tearing/driving the pfn region (which is static and the
underlying PFNs won't change throughout device lifetime), and vfio
pinning/unpinning the pfns (which are refcounted against multiple map/unmaps);

> This code needs to be deleted, not extended :(

To some extent it isn't really an extension: the patch was just removing the
assumption @vm_pgoff being the 'start pfn' on PFNMAP vmas. This is also
similarly done by get_vaddr_frames().

	Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
  2020-02-11 16:23     ` Joao Martins
@ 2020-02-11 16:50       ` Jason Gunthorpe
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2020-02-11 16:50 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-nvdimm, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

On Tue, Feb 11, 2020 at 04:23:49PM +0000, Joao Martins wrote:
> On 2/7/20 9:08 PM, Jason Gunthorpe wrote:
> > On Fri, Jan 10, 2020 at 07:03:12PM +0000, Joao Martins wrote:
> >> From: Nikita Leshenko <nikita.leshchenko@oracle.com>
> >>
> >> Unconditionally interpreting vm_pgoff as a PFN is incorrect.
> >>
> >> VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
> >> that the VMA doesn't have an associated struct page and is being managed
> >> directly by something other than the core mmu.
> >>
> >> Use follow_pfn like KVM does to find the PFN.
> >>
> >> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> >>  drivers/vfio/vfio_iommu_type1.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 2ada8e6cdb88..1e43581f95ea 100644
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >>  	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
> >>  
> >>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> >> -		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> -		if (is_invalid_reserved_pfn(*pfn))
> >> -			ret = 0;
> >> +		ret = follow_pfn(vma, vaddr, pfn);
> >> +		if (!ret && !is_invalid_reserved_pfn(*pfn))
> >> +			ret = -EOPNOTSUPP;
> >>  	}
> > 
> > FWIW this existing code is a huge hack and a security problem.
> > 
> > I'm not sure how you could be successfully using this path on actual
> > memory without hitting bad bugs?
> > 
> ATM I think this codepath is largelly hit at the moment for MMIO (GPU
> passthrough, or mdev). In the context of this patch, guest memory would be
> treated similarly meaning the device-dax backing memory wouldn't have a 'struct
> page' (as introduced in this series).

I think it is being used specifically to allow two VFIO's to be
inserted into a VM and have the IOMMU setup to allow MMIO access.

> > Fudamentally VFIO can't retain a reference to a page from within a VMA
> > without some kind of recount/locking/etc to allow the thing that put
> > the page there to know it is still being used (ie programmed in a
> > IOMMU) by VFIO.
> > 
> > Otherwise it creates use-after-free style security problems on the
> > page.
>
> I take it you're referring to the past problems with long term page pinning +
> fsdax? Or you had something else in mind, perhaps related to your LSFMM topic?

No. I'm refering to retaining access to memory backed a VMA without
holding any kind of locking on it. This is an access after free scenario.

It *should* be like a long term page pin so that the VMA owner knows
something is happening.
 
> Here the memory can't be used by the kernel (and there's no struct page) except
> from device-dax managing/tearing/driving the pfn region (which is static and the
> underlying PFNs won't change throughout device lifetime), and vfio
> pinning/unpinning the pfns (which are refcounted against multiple map/unmaps);

For instance if you tear down the device-dax then VFIO will happily
continue to reference the memory. This is a bug.

There are other cases that escalate to security bugs.

> > This code needs to be deleted, not extended :(
> 
> To some extent it isn't really an extension: the patch was just removing the
> assumption @vm_pgoff being the 'start pfn' on PFNMAP vmas. This is also
> similarly done by get_vaddr_frames().

You are extending it in the sense that you plan to use it for more
cases than VMAs created by some other VFIO. That should not be
done as it will only complicate fixing this code.

KVM is allowed to use follow_pfn because it uses MMU notifiers and
does not allow the result of follow_pfn to outlive the VMA (AFAIK at
least). So it should be safe.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-02-11 16:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
2020-02-03 21:34   ` Matthew Wilcox
2020-02-04 16:14     ` Joao Martins
2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
2020-02-03 21:37   ` Matthew Wilcox
2020-02-04 16:17     ` Joao Martins
2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
2020-02-07 21:08   ` Jason Gunthorpe
2020-02-11 16:23     ` Joao Martins
2020-02-11 16:50       ` Jason Gunthorpe
2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
2020-02-04 15:28   ` Barret Rhoden
2020-02-04 16:44     ` Dan Williams
2020-02-04 18:20       ` Barret Rhoden
2020-02-04 19:24         ` Joao Martins
2020-02-04 21:43         ` Dan Williams
2020-02-04 21:57           ` Barret Rhoden
2020-02-04  1:24 ` [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Dan Williams
2020-02-04 19:07   ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).