KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 00/10] device-dax: Support devices without PFN metadata
@ 2020-01-10 19:03 Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Hey,

Presented herewith a small series which allows device-dax to work without
struct page to be used to back KVM guests memory. It's an RFC, and there's
still some items we're looking at (see TODO below); but wondering if folks
would be OK carving some time out of their busy schedules to provide feedback
direction-wise on this work.

In virtualized environments (specially those with no kernel-backed PV
interfaces, and just SR-IOV), memory is largelly assigned to guests: either
persistent with NVDIMMs or volatile for regular RAM. The kernel
(hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
need  which could instead be used to create other guests. One of motivations of
this series is to then get that memory used for 'struct page', when it is meant
to solely be used by userspace. This is also useful for the case of memory
backing guests virtual NVDIMMs. The other neat side effect is that the
hypervisor has no virtual mapping of the guest and hence code gadgets (if
found) are limited in their effectiveness.

It is expected that a smaller (instead of total) amount of host memory is
defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
QEMU), the main thing that is needed is a device which open + mmap + close with
a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
just that and so the work comprised here was improving what's there and the
interfaces it uses.

The series is divided as follows:

 * Patch 1 , 3: Preparatory work for patch 7 for adding support for
	       vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
	
 * Patch 2 , 4: Preparatory work for patch 7 for adding support for
	       follow_pfn() to work with 2M/1G huge pages, which is
	       what KVM uses for VM_PFNMAP.

 * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
	       which encompasses mainly dealing with the lack of devmap,
	       and creating a VM_PFNMAP vma.

 * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
	   At the very end of the cover letter (after scissors mark),
	   there's a patch for ndctl to be able to create namespaces
	   with '--mode devdax --map none'.

 * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
 	    a PFN.

 * Patch 10: The actual end consumer example for RAM case. The patch just adds a
	     label storage area which consequently allows namespaces to be
	     created. We picked PMEM legacy for starters.

Thoughts, coments appreciated.

	Joao

P.S. As an example to try this out:

 1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
 and kernel has 16G.

 2) create a devdax namespace with 1G hugepages: 

 $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
 {
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"none",
  "size":"32.00 GiB (34.36 GB)",
  "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
  "daxregion":{
    "id":0,
    "size":"32.00 GiB (34.36 GB)",
    "align":1073741824,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"32.00 GiB (34.36 GB)",
        "target_node":0,
        "mode":"devdax"
      }
    ]
  },
  "align":1073741824
 }

 3) Add this to your qemu params:
  -m 32G 
  -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
  -numa node,memdev=mem

TODO:

 * Discontiguous regions/namespaces: The work above is limited to max
contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
is because of what specs allow for persistent namespaces. But for volatile RAM
case we would need handling of discontiguous extents (hence a region would represent
more than a resource) to be less bound to how guests are placed on the system.
I played around with multi-resource for device-dax, but I'm wondering about
UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
case (under certain nvdimm region bits); or if 2) device-dax would have it's
own separate UABI to be used by daxctl (which would be also useful for hmem
devices?).

 * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
device-dax, which would allow collect_procs() to find the processes solely based
on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
looking at the dax-region pfn range for each DAX vma.

 * NUMA: For now excluded setting the target_node; while these two patches
 are being worked on[1][2].

 [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
 [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/


Joao Martins (9):
  mm: Add pmd support for _PAGE_SPECIAL
  mm: Handle pmd entries in follow_pfn()
  mm: Add pud support for _PAGE_SPECIAL
  mm: Handle pud entries in follow_pfn()
  device-dax: Do not enforce MADV_DONTFORK on mmap()
  device-dax: Introduce pfn_flags helper
  device-dax: Add support for PFN_SPECIAL flags
  dax/pmem: Add device-dax support for PFN_MODE_NONE
  nvdimm/e820: add multiple namespaces support

Nikita Leshenko (1):
  vfio/type1: Use follow_pfn for VM_FPNMAP VMAs

 arch/x86/include/asm/pgtable.h  |  34 ++++-
 drivers/dax/bus.c               |   3 +-
 drivers/dax/device.c            |  78 ++++++++----
 drivers/dax/pmem/core.c         |  36 +++++-
 drivers/nvdimm/e820.c           | 212 ++++++++++++++++++++++++++++----
 drivers/vfio/vfio_iommu_type1.c |   6 +-
 mm/gup.c                        |   6 +
 mm/huge_memory.c                |  15 ++-
 mm/memory.c                     |  67 ++++++++--
 9 files changed, 382 insertions(+), 75 deletions(-)

8>----------------
Subject: [PATCH] ndctl: add 'devdax' support for NDCTL_PFN_LOC_NONE

diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 7fb00078646b..2568943eb207 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -206,6 +206,8 @@ static int set_defaults(enum device_action mode)
 			/* pass */;
 		else if (strcmp(param.map, "dev") == 0)
 			/* pass */;
+		else if (strcmp(param.map, "none") == 0)
+			/* pass */;
 		else {
 			error("invalid map location '%s'\n", param.map);
 			rc = -EINVAL;
@@ -755,9 +757,17 @@ static int validate_namespace_options(struct ndctl_region *region,
 	if (param.map) {
 		if (!strcmp(param.map, "mem"))
 			p->loc = NDCTL_PFN_LOC_RAM;
+		else if (!strcmp(param.map, "none"))
+			p->loc = NDCTL_PFN_LOC_NONE;
 		else
 			p->loc = NDCTL_PFN_LOC_PMEM;
 
+		if (p->loc == NDCTL_PFN_LOC_NONE
+			&& p->mode != NDCTL_NS_MODE_DAX) {
+			debug("--map=none only valid for devdax mode namespace\n");
+			return -EINVAL;
+		}
+
 		if (ndns && p->mode != NDCTL_NS_MODE_MEMORY
 			&& p->mode != NDCTL_NS_MODE_DAX) {
 			debug("%s: --map= only valid for fsdax mode namespace\n",


-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently vmf_insert_pfn_pmd only works with devmap
and BUG_ON otherwise. Add support for handling
page special when pfn_t has it marked with PFN_SPECIAL.

Usage of page special aren't expected to do GUP
hence return no pages on gup_huge_pmd() much like how
it is done for ptes on gup_pte_range().

This allows a DAX driver to handle 2M hugepages without
struct pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/include/asm/pgtable.h | 16 +++++++++++++++-
 mm/gup.c                       |  3 +++
 mm/huge_memory.c               |  7 ++++---
 mm/memory.c                    |  3 ++-
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ad97dc155195..60351c0c15fe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -255,7 +255,7 @@ static inline int pmd_large(pmd_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
 {
 	return 0;
 }
+#endif
+
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline int pmd_special(pmd_t pmd)
+{
+	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
+}
+#endif
+
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -414,6 +423,11 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
 }
 
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..ba5f10535392 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
 	}
 
+	if (pmd_special(orig))
+		return 0;
+
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..06ad4d6f7477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else if (pfn_t_special(pfn))
+		entry = pmd_mkspecial(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -2013,7 +2014,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+			pmd_devmap(*pmd) || pmd_special(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9a4f52..cfc3668bddeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1165,7 +1165,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+		    pmd_devmap(*pmd) || pmd_special(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

When follow_pfn hits a pmd_huge() it won't return a valid PFN
given it's usage of follow_pte(). Fix that up to pass a @pmdpp
and thus allow callers to get the pmd pointer. If we encounter
such a huge page, we calculate the pfn offset to the PMD
accordingly.

This allows KVM to handle 2M hugepage pfns on VM_PFNMAP vmas.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index cfc3668bddeb..db99684d2cb3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4366,6 +4366,7 @@ EXPORT_SYMBOL(follow_pte_pmd);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn)
 {
+	pmd_t *pmdpp = NULL;
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -4373,10 +4374,14 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
+	ret = follow_pte_pmd(vma->vm_mm, address, NULL,
+			     &ptep, &pmdpp, &ptl);
 	if (ret)
 		return ret;
-	*pfn = pte_pfn(*ptep);
+	if (pmdpp)
+		*pfn = pmd_pfn(*pmdpp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+	else
+		*pfn = pte_pfn(*ptep);
 	pte_unmap_unlock(ptep, ptl);
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently vmf_insert_pfn_pud only works with devmap
and BUG_ON otherwise. Add support for handling
page special when pfn_t has it marked with PFN_SPECIAL.

Usage of this type of pages aren't expected to do GUP
hence return no pages on gup_huge_pud() much like how
it is done for ptes on gup_pte_range() and for pmds on
gup_huge_pmd().

This allows device-dax to handle 1G hugepages without
struct pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/include/asm/pgtable.h | 18 +++++++++++++++++-
 mm/gup.c                       |  3 +++
 mm/huge_memory.c               |  8 +++++---
 mm/memory.c                    |  3 ++-
 4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 60351c0c15fe..2027c063fa16 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -261,7 +261,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
 }
 #endif
 
@@ -300,6 +300,17 @@ static inline int pmd_special(pmd_t pmd)
 {
 	return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
 }
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline int pud_special(pud_t pud)
+{
+	return !!(pud_flags(pud) & _PAGE_SPECIAL);
+}
+#else
+static inline int pud_special(pud_t pud)
+{
+	return 0;
+}
 #endif
 
 #endif
@@ -487,6 +498,11 @@ static inline pud_t pud_mkhuge(pud_t pud)
 	return pud_set_flags(pud, _PAGE_PSE);
 }
 
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL);
+}
+
 static inline pud_t pud_mkyoung(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_ACCESSED);
diff --git a/mm/gup.c b/mm/gup.c
index ba5f10535392..ae4abe5878ad 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2123,6 +2123,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
 	}
 
+	if (pud_special(orig))
+		return 0;
+
 	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06ad4d6f7477..cff707163bc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -879,6 +879,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	else if (pfn_t_special(pfn))
+		entry = pud_mkspecial(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -901,8 +903,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -2031,7 +2032,8 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)) ||
+		   pud_special(*pud))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index db99684d2cb3..109643219e1b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1201,7 +1201,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud) || pud_devmap(*pud) ||
+		    pud_special(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
 				split_huge_pud(vma, pud, addr);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (2 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

When follow_pfn hits a pud_huge() it won't return a valid PFN
for a PUD. Fix it by adding @pudp and thus allow callers to
get the pud pointer. If we encounter such a huge page, we
calculate the offset to the PUD accordingly.

This allows KVM to handle 1G hugepage pfns on VM_PFNMAP vmas.

Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 109643219e1b..f46646630497 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4261,9 +4261,10 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+static int __follow_pte_pud(struct mm_struct *mm, unsigned long address,
 			    struct mmu_notifier_range *range,
-			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp,
+			    spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4280,6 +4281,28 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pud = pud_offset(p4d, address);
+	VM_BUG_ON(pud_trans_huge(*pud));
+
+	if (pud_huge(*pud)) {
+		if (!pudpp)
+			goto out;
+
+		if (range) {
+			mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0,
+						NULL, mm, address & PUD_MASK,
+						(address & PUD_MASK) + PUD_SIZE);
+			mmu_notifier_invalidate_range_start(range);
+		}
+		*ptlp = pud_lock(mm, pud);
+		if (pud_huge(*pud)) {
+			*pudpp = pud;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+		if (range)
+			mmu_notifier_invalidate_range_end(range);
+	}
+
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto out;
 
@@ -4335,8 +4358,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, NULL,
-						    ptepp, NULL, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, NULL,
+						    ptepp, NULL, NULL, ptlp)));
 	return res;
 }
 
@@ -4348,12 +4371,26 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, range,
-						    ptepp, pmdpp, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, NULL, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
 
+static int follow_pte_pud(struct mm_struct *mm, unsigned long address,
+			  struct mmu_notifier_range *range,
+			  pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp,
+			  spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, pudpp, ptlp)));
+	return res;
+}
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4368,6 +4405,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn)
 {
 	pmd_t *pmdpp = NULL;
+	pud_t *pudpp = NULL;
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -4375,11 +4413,13 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte_pmd(vma->vm_mm, address, NULL,
-			     &ptep, &pmdpp, &ptl);
+	ret = follow_pte_pud(vma->vm_mm, address, NULL,
+			     &ptep, &pmdpp, &pudpp, &ptl);
 	if (ret)
 		return ret;
-	if (pmdpp)
+	if (pudpp)
+		*pfn = pud_pfn(*pudpp) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	else if (pmdpp)
 		*pfn = pmd_pfn(*pmdpp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	else
 		*pfn = pte_pfn(*ptep);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap()
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (3 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Currently check_vma() checks for VM_DONTCOPY for a device-dax
when the dax region is not backed by devmap (i.e. PFN_MAP is not set).
VM_DONTCOPY is set through madvise(MADV_DONTFORK) and it only sets it
at an address returned from mmap(). check_vma() is called at devdax
mmap hence checking VM_DONTCOPY prevents a process from mmap-ing the
device.

Let's not enforce MADV_DONTFORK at mmap(), but rather when it actually
gets used (on fault). The assumptions don't change, as it is
expected to still retain/madvise MADV_DONTFORK after mmap.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 1af823b2fe6b..c6a7f5e12c54 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,7 +14,7 @@
 #include "dax-private.h"
 #include "bus.h"
 
-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+static int check_vma_mmap(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		const char *func)
 {
 	struct dax_region *dax_region = dev_dax->region;
@@ -41,17 +41,29 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		return -EINVAL;
 	}
 
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
-			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+	if (!vma_is_dax(vma)) {
 		dev_info_ratelimited(dev,
-				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
+				"%s: %s: fail, vma is not DAX capable\n",
 				current->comm, func);
 		return -EINVAL;
 	}
 
-	if (!vma_is_dax(vma)) {
-		dev_info_ratelimited(dev,
-				"%s: %s: fail, vma is not DAX capable\n",
+	return 0;
+}
+
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+		     const char *func)
+{
+	int rc;
+
+	rc = check_vma_mmap(dev_dax, vma, func);
+	if (rc < 0)
+		return rc;
+
+	if ((dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
+			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+		dev_info_ratelimited(&dev_dax->dev,
+				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
 				current->comm, func);
 		return -EINVAL;
 	}
@@ -315,7 +327,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * fault time.
 	 */
 	id = dax_read_lock();
-	rc = check_vma(dev_dax, vma, __func__);
+	rc = check_vma_mmap(dev_dax, vma, __func__);
 	dax_read_unlock(id);
 	if (rc)
 		return rc;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (4 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Replace PFN_DEV|PFN_MAP check call sites with two helper functions
dax_is_pfn_dev() and dax_is_pfn_map().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index c6a7f5e12c54..113a554de3ee 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,6 +14,17 @@
 #include "dax-private.h"
 #include "bus.h"
 
+static int dax_is_pfn_dev(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV;
+}
+
+static int dax_is_pfn_map(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags &
+		(PFN_DEV|PFN_MAP)) == (PFN_DEV|PFN_MAP);
+}
+
 static int check_vma_mmap(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		const char *func)
 {
@@ -60,8 +71,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	if (rc < 0)
 		return rc;
 
-	if ((dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
-			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
+	if (dax_is_pfn_dev(dev_dax) && (vma->vm_flags & VM_DONTCOPY) == 0) {
 		dev_info_ratelimited(&dev_dax->dev,
 				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
 				current->comm, func);
@@ -140,7 +150,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 	}
 
 	/* dax pmd mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
+	if (!dax_is_pfn_map(dev_dax)) {
 		dev_dbg(dev, "region lacks devmap flags\n");
 		return VM_FAULT_SIGBUS;
 	}
@@ -190,7 +200,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 	}
 
 	/* dax pud mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
+	if (!dax_is_pfn_map(dev_dax)) {
 		dev_dbg(dev, "region lacks devmap flags\n");
 		return VM_FAULT_SIGBUS;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (5 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Right now we assume there's gonna be a PFN_DEV|PFN_MAP which
means it will have a struct page backing the PFN but that is
not placed in normal system RAM zones.

Add support for PFN_DEV|PFN_SPECIAL only and therefore the
underlying vma won't have a struct page. For device dax, this
means not assuming callers will pass a dev_pagemap, and avoid
returning SIGBUS for the lack of PFN_MAP region pfn flag and
finally not setting struct page index/mapping on fault.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/bus.c    |  3 ++-
 drivers/dax/device.c | 40 ++++++++++++++++++++++------------------
 2 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 46e46047a1f7..96ca3ac85278 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -414,7 +414,8 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
-	memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
+	if (pgmap)
+		memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
 
 	/*
 	 * No 'host' or dax_operations since there is no access to this
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 113a554de3ee..aa38f5ff180a 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -14,6 +14,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+static int dax_is_pfn_special(struct dev_dax *dev_dax)
+{
+	return (dev_dax->region->pfn_flags &
+		(PFN_DEV|PFN_SPECIAL)) == (PFN_DEV|PFN_SPECIAL);
+}
+
 static int dax_is_pfn_dev(struct dev_dax *dev_dax)
 {
 	return (dev_dax->region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV;
@@ -104,6 +110,7 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 	struct dax_region *dax_region;
 	phys_addr_t phys;
 	unsigned int fault_size = PAGE_SIZE;
+	int rc;
 
 	if (check_vma(dev_dax, vmf->vma, __func__))
 		return VM_FAULT_SIGBUS;
@@ -126,7 +133,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
+	if (dax_is_pfn_special(dev_dax))
+		rc = vmf_insert_pfn(vmf->vma, vmf->address, pfn_t_to_pfn(*pfn));
+	else
+		rc = vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
+
+	return rc;
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -149,12 +161,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!dax_is_pfn_map(dev_dax)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -199,12 +205,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pud mappings require pfn_t_devmap() */
-	if (!dax_is_pfn_map(dev_dax)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -266,7 +266,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		rc = VM_FAULT_SIGBUS;
 	}
 
-	if (rc == VM_FAULT_NOPAGE) {
+	if (dax_is_pfn_map(dev_dax) && (rc == VM_FAULT_NOPAGE)) {
 		unsigned long i;
 		pgoff_t pgoff;
 
@@ -344,6 +344,8 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	vma->vm_ops = &dax_vm_ops;
 	vma->vm_flags |= VM_HUGEPAGE;
+	if (dax_is_pfn_special(dev_dax))
+		vma->vm_flags |= VM_PFNMAP;
 	return 0;
 }
 
@@ -450,10 +452,12 @@ int dev_dax_probe(struct device *dev)
 		return -EBUSY;
 	}
 
-	dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
-	addr = devm_memremap_pages(dev, &dev_dax->pgmap);
-	if (IS_ERR(addr))
-		return PTR_ERR(addr);
+	if (dax_is_pfn_map(dev_dax)) {
+		dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+		addr = devm_memremap_pages(dev, &dev_dax->pgmap);
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
+	}
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (6 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

Allowing dax pmem driver to work without struct pages means
that user will request to not create any PFN metadata by writing
seed's device mode to PFN_MODE_NONE.

When the underlying nd_pfn->mode is PFN_MODE_NONE, most dax_pmem
initialization steps can be skipped because we won't have/need a
pfn superblock for the pagemap/struct-pages. We only allocate
an opaque zeroed object with the chosen align requested, and
finally add PFN_SPECIAL to the region pfn_flags.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/pmem/core.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 2bedf8414fff..67f5604a8291 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -17,15 +17,38 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
 	struct nd_namespace_io *nsio;
 	struct dax_region *dax_region;
 	struct dev_pagemap pgmap = { };
+	struct dev_pagemap *devmap = NULL;
 	struct nd_namespace_common *ndns;
 	struct nd_dax *nd_dax = to_nd_dax(dev);
 	struct nd_pfn *nd_pfn = &nd_dax->nd_pfn;
 	struct nd_region *nd_region = to_nd_region(dev->parent);
+	unsigned long long pfn_flags = PFN_DEV;
 
 	ndns = nvdimm_namespace_common_probe(dev);
 	if (IS_ERR(ndns))
 		return ERR_CAST(ndns);
 
+	rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", &region_id, &id);
+	if (rc != 2)
+		return ERR_PTR(-EINVAL);
+
+	if (is_nd_dax(&nd_pfn->dev) && nd_pfn->mode == PFN_MODE_NONE) {
+		/* allocate a dummy super block */
+		pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb),
+				      GFP_KERNEL);
+		if (!pfn_sb)
+			return ERR_PTR(-ENOMEM);
+
+		memset(pfn_sb, 0, sizeof(*pfn_sb));
+		pfn_sb->align = nd_pfn->align;
+		nd_pfn->pfn_sb = pfn_sb;
+		pfn_flags |= PFN_SPECIAL;
+
+		nsio = to_nd_namespace_io(&ndns->dev);
+		memcpy(&res, &nsio->res, sizeof(res));
+		goto no_pfn_sb;
+	}
+
 	/* parse the 'pfn' info block via ->rw_bytes */
 	rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
 	if (rc)
@@ -45,20 +68,21 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
 		return ERR_PTR(-EBUSY);
 	}
 
-	rc = sscanf(dev_name(&ndns->dev), "namespace%d.%d", &region_id, &id);
-	if (rc != 2)
-		return ERR_PTR(-EINVAL);
-
 	/* adjust the dax_region resource to the start of data */
 	memcpy(&res, &pgmap.res, sizeof(res));
 	res.start += offset;
+	devmap = &pgmap;
+	pfn_flags |= PFN_MAP;
+
+no_pfn_sb:
 	dax_region = alloc_dax_region(dev, region_id, &res,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			PFN_DEV|PFN_MAP);
+			pfn_flags);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
-	dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys);
+
+	dev_dax = __devm_create_dev_dax(dax_region, id, devmap, subsys);
 
 	/* child dev_dax instances now own the lifetime of the dax_region */
 	dax_region_put(dax_region);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (7 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

From: Nikita Leshenko <nikita.leshchenko@oracle.com>

Unconditionally interpreting vm_pgoff as a PFN is incorrect.

VMAs created by /dev/mem do this, but in general VM_PFNMAP just means
that the VMA doesn't have an associated struct page and is being managed
directly by something other than the core mmu.

Use follow_pfn like KVM does to find the PFN.

Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..1e43581f95ea 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -362,9 +362,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-		if (is_invalid_reserved_pfn(*pfn))
-			ret = 0;
+		ret = follow_pfn(vma, vaddr, pfn);
+		if (!ret && !is_invalid_reserved_pfn(*pfn))
+			ret = -EOPNOTSUPP;
 	}
 
 	up_read(&mm->mmap_sem);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support
  2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
                   ` (8 preceding siblings ...)
  2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
@ 2020-01-10 19:03 ` Joao Martins
  9 siblings, 0 replies; 11+ messages in thread
From: Joao Martins @ 2020-01-10 19:03 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alex Williamson, Cornelia Huck, kvm, Andrew Morton, linux-mm,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, Liran Alon, Nikita Leshenko, Barret Rhoden,
	Boris Ostrovsky, Matthew Wilcox, Konrad Rzeszutek Wilk

User can define regions with 'memmap=size!offset' which in turn
creates PMEM legacy devices. But because it is a label-less
NVDIMM device we only have one namespace for the whole device.

Add support for multiple namespaces by adding ndctl control
support, and exposing a minimal set of features:
(ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
store labels.

Initialization is a little different: We allocate and register an
nvdimm bus with an @nvdimm_descriptor which we use to locate
where we are keeping our label storage area. The config data
get/set/size operations are then simply memcpying to this area.

Equivalent approach can also be found in the NFIT tests which
emulate the same thing.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/nvdimm/e820.c | 212 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 191 insertions(+), 21 deletions(-)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index e02f60ad6c99..36fbff3d7110 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -7,14 +7,21 @@
 #include <linux/memory_hotplug.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/ndctl.h>
+#include <linux/nd.h>
 
-static int e820_pmem_remove(struct platform_device *pdev)
-{
-	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
+#define LABEL_SIZE SZ_128K
 
-	nvdimm_bus_unregister(nvdimm_bus);
-	return 0;
-}
+struct e820_descriptor {
+	struct nd_interleave_set nd_set;
+	struct nvdimm_bus_descriptor nd_desc;
+	void *label;
+	unsigned char cookie1[16];
+	unsigned char cookie2[16];
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm *nvdimm;
+};
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 static int e820_range_to_nid(resource_size_t addr)
@@ -28,43 +35,206 @@ static int e820_range_to_nid(resource_size_t addr)
 }
 #endif
 
+static int e820_get_config_size(struct nd_cmd_get_config_size *nd_cmd,
+				unsigned int buf_len)
+{
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+
+	nd_cmd->status = 0;
+	nd_cmd->config_size = LABEL_SIZE;
+	nd_cmd->max_xfer = SZ_4K;
+
+	return 0;
+}
+
+static int e820_get_config_data(struct nd_cmd_get_config_data_hdr
+		*nd_cmd, unsigned int buf_len, void *label)
+{
+	unsigned int len, offset = nd_cmd->in_offset;
+	int rc;
+
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+	if (offset >= LABEL_SIZE)
+		return -EINVAL;
+	if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+		return -EINVAL;
+
+	nd_cmd->status = 0;
+	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+	memcpy(nd_cmd->out_buf, label + offset, len);
+	rc = buf_len - sizeof(*nd_cmd) - len;
+
+	return rc;
+}
+
+static int e820_set_config_data(struct nd_cmd_set_config_hdr *nd_cmd,
+		unsigned int buf_len, void *label)
+{
+	unsigned int len, offset = nd_cmd->in_offset;
+	u32 *status;
+	int rc;
+
+	if (buf_len < sizeof(*nd_cmd))
+		return -EINVAL;
+	if (offset >= LABEL_SIZE)
+		return -EINVAL;
+	if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+		return -EINVAL;
+
+	status = (void *)nd_cmd + nd_cmd->in_length + sizeof(*nd_cmd);
+	*status = 0;
+	len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+	memcpy(label + offset, nd_cmd->in_buf, len);
+	rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+
+	return rc;
+}
+
+static struct e820_descriptor *to_e820_desc(struct nvdimm_bus_descriptor *desc)
+{
+	return container_of(desc, struct e820_descriptor, nd_desc);
+}
+
+static int e820_ndctl(struct nvdimm_bus_descriptor *nd_desc,
+			 struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+			 unsigned int buf_len, int *cmd_rc)
+{
+	struct e820_descriptor *t = to_e820_desc(nd_desc);
+	int rc = -EINVAL;
+
+	switch (cmd) {
+	case ND_CMD_GET_CONFIG_SIZE:
+		rc = e820_get_config_size(buf, buf_len);
+		break;
+	case ND_CMD_GET_CONFIG_DATA:
+		rc = e820_get_config_data(buf, buf_len, t->label);
+		break;
+	case ND_CMD_SET_CONFIG_DATA:
+		rc = e820_set_config_data(buf, buf_len, t->label);
+		break;
+	default:
+		return rc;
+	}
+
+	return rc;
+}
+
+static void e820_desc_free(struct e820_descriptor *desc)
+{
+	if (!desc)
+		return;
+
+	nvdimm_bus_unregister(desc->nvdimm_bus);
+	kfree(desc->label);
+	kfree(desc);
+}
+
+static struct e820_descriptor *e820_desc_alloc(struct platform_device *pdev)
+{
+	struct nvdimm_bus_descriptor *nd_desc;
+	unsigned int cmd_mask, dimm_flags;
+	struct device *dev = &pdev->dev;
+	struct nvdimm_bus *nvdimm_bus;
+	struct e820_descriptor *desc;
+	struct nvdimm *nvdimm;
+
+	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+	if (!desc)
+		goto err;
+
+	desc->label = kzalloc(LABEL_SIZE, GFP_KERNEL);
+	if (!desc->label)
+		goto err;
+
+	nd_desc = &desc->nd_desc;
+	nd_desc->provider_name = "e820";
+	nd_desc->module = THIS_MODULE;
+	nd_desc->ndctl = e820_ndctl;
+	nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+	if (!nvdimm_bus) {
+		dev_err(dev, "nvdimm bus registration failure\n");
+		goto err;
+	}
+	desc->nvdimm_bus = nvdimm_bus;
+
+	cmd_mask = (1UL << ND_CMD_GET_CONFIG_SIZE |
+			1UL << ND_CMD_GET_CONFIG_DATA |
+			1UL << ND_CMD_SET_CONFIG_DATA);
+	dimm_flags = (1UL << NDD_ALIASING);
+	nvdimm = nvdimm_create(nvdimm_bus, pdev, NULL,
+				dimm_flags, cmd_mask, 0, NULL);
+	if (!nvdimm) {
+		dev_err(dev, "nvdimm creation failure\n");
+		goto err;
+	}
+	desc->nvdimm = nvdimm;
+	return desc;
+
+err:
+	e820_desc_free(desc);
+	return NULL;
+}
+
 static int e820_register_one(struct resource *res, void *data)
 {
+	struct platform_device *pdev = data;
 	struct nd_region_desc ndr_desc;
-	struct nvdimm_bus *nvdimm_bus = data;
+	struct nd_mapping_desc mapping;
+	struct e820_descriptor *desc;
+
+	desc = e820_desc_alloc(pdev);
+	if (!desc)
+		return -ENOMEM;
+
+	mapping.nvdimm = desc->nvdimm;
+	mapping.start = res->start;
+	mapping.size = resource_size(res);
+	mapping.position = 0;
+
+	generate_random_uuid(desc->cookie1);
+	desc->nd_set.cookie1 = (u64) desc->cookie1;
+	generate_random_uuid(desc->cookie2);
+	desc->nd_set.cookie2 = (u64) desc->cookie2;
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	ndr_desc.res = res;
 	ndr_desc.numa_node = e820_range_to_nid(res->start);
 	ndr_desc.target_node = ndr_desc.numa_node;
+	ndr_desc.mapping = &mapping;
+	ndr_desc.num_mappings = 1;
+	ndr_desc.nd_set = &desc->nd_set;
 	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
-	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+	if (!nvdimm_pmem_region_create(desc->nvdimm_bus, &ndr_desc)) {
+		e820_desc_free(desc);
+		dev_err(&pdev->dev, "nvdimm region creation failure\n");
 		return -ENXIO;
+	}
+
+	platform_set_drvdata(pdev, desc);
+	return 0;
+}
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+	struct e820_descriptor *desc = platform_get_drvdata(pdev);
+
+	e820_desc_free(desc);
 	return 0;
 }
 
 static int e820_pmem_probe(struct platform_device *pdev)
 {
-	static struct nvdimm_bus_descriptor nd_desc;
-	struct device *dev = &pdev->dev;
-	struct nvdimm_bus *nvdimm_bus;
 	int rc = -ENXIO;
 
-	nd_desc.provider_name = "e820";
-	nd_desc.module = THIS_MODULE;
-	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
-	if (!nvdimm_bus)
-		goto err;
-	platform_set_drvdata(pdev, nvdimm_bus);
-
 	rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
-			IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
+			IORESOURCE_MEM, 0, -1, pdev, e820_register_one);
 	if (rc)
 		goto err;
 	return 0;
 err:
-	nvdimm_bus_unregister(nvdimm_bus);
-	dev_err(dev, "failed to register legacy persistent memory ranges\n");
+	dev_err(&pdev->dev, "failed to register legacy persistent memory ranges\n");
 	return rc;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 19:03 [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Joao Martins
2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git