KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: linux-nvdimm@lists.01.org
Cc: Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	kvm@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H . Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Liran Alon <liran.alon@oracle.com>,
	Nikita Leshenko <nikita.leshchenko@oracle.com>,
	Barret Rhoden <brho@google.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata
Date: Fri, 10 Jan 2020 19:03:03 +0000
Message-ID: <20200110190313.17144-1-joao.m.martins@oracle.com> (raw)


Presented herewith a small series which allows device-dax to work without
struct page to be used to back KVM guests memory. It's an RFC, and there's
still some items we're looking at (see TODO below); but wondering if folks
would be OK carving some time out of their busy schedules to provide feedback
direction-wise on this work.

In virtualized environments (specially those with no kernel-backed PV
interfaces, and just SR-IOV), memory is largelly assigned to guests: either
persistent with NVDIMMs or volatile for regular RAM. The kernel
(hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
need  which could instead be used to create other guests. One of motivations of
this series is to then get that memory used for 'struct page', when it is meant
to solely be used by userspace. This is also useful for the case of memory
backing guests virtual NVDIMMs. The other neat side effect is that the
hypervisor has no virtual mapping of the guest and hence code gadgets (if
found) are limited in their effectiveness.

It is expected that a smaller (instead of total) amount of host memory is
defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
QEMU), the main thing that is needed is a device which open + mmap + close with
a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
just that and so the work comprised here was improving what's there and the
interfaces it uses.

The series is divided as follows:

 * Patch 1 , 3: Preparatory work for patch 7 for adding support for
	       vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
 * Patch 2 , 4: Preparatory work for patch 7 for adding support for
	       follow_pfn() to work with 2M/1G huge pages, which is
	       what KVM uses for VM_PFNMAP.

 * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
	       which encompasses mainly dealing with the lack of devmap,
	       and creating a VM_PFNMAP vma.

 * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
	   At the very end of the cover letter (after scissors mark),
	   there's a patch for ndctl to be able to create namespaces
	   with '--mode devdax --map none'.

 * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
 	    a PFN.

 * Patch 10: The actual end consumer example for RAM case. The patch just adds a
	     label storage area which consequently allows namespaces to be
	     created. We picked PMEM legacy for starters.

Thoughts, coments appreciated.


P.S. As an example to try this out:

 1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
 and kernel has 16G.

 2) create a devdax namespace with 1G hugepages: 

 $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
  "size":"32.00 GiB (34.36 GB)",
    "size":"32.00 GiB (34.36 GB)",
        "size":"32.00 GiB (34.36 GB)",

 3) Add this to your qemu params:
  -m 32G 
  -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
  -numa node,memdev=mem


 * Discontiguous regions/namespaces: The work above is limited to max
contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
is because of what specs allow for persistent namespaces. But for volatile RAM
case we would need handling of discontiguous extents (hence a region would represent
more than a resource) to be less bound to how guests are placed on the system.
I played around with multi-resource for device-dax, but I'm wondering about
UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
case (under certain nvdimm region bits); or if 2) device-dax would have it's
own separate UABI to be used by daxctl (which would be also useful for hmem

 * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
device-dax, which would allow collect_procs() to find the processes solely based
on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
looking at the dax-region pfn range for each DAX vma.

 * NUMA: For now excluded setting the target_node; while these two patches
 are being worked on[1][2].

 [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
 [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/

Joao Martins (9):
  mm: Add pmd support for _PAGE_SPECIAL
  mm: Handle pmd entries in follow_pfn()
  mm: Add pud support for _PAGE_SPECIAL
  mm: Handle pud entries in follow_pfn()
  device-dax: Do not enforce MADV_DONTFORK on mmap()
  device-dax: Introduce pfn_flags helper
  device-dax: Add support for PFN_SPECIAL flags
  dax/pmem: Add device-dax support for PFN_MODE_NONE
  nvdimm/e820: add multiple namespaces support

Nikita Leshenko (1):
  vfio/type1: Use follow_pfn for VM_FPNMAP VMAs

 arch/x86/include/asm/pgtable.h  |  34 ++++-
 drivers/dax/bus.c               |   3 +-
 drivers/dax/device.c            |  78 ++++++++----
 drivers/dax/pmem/core.c         |  36 +++++-
 drivers/nvdimm/e820.c           | 212 ++++++++++++++++++++++++++++----
 drivers/vfio/vfio_iommu_type1.c |   6 +-
 mm/gup.c                        |   6 +
 mm/huge_memory.c                |  15 ++-
 mm/memory.c                     |  67 ++++++++--
 9 files changed, 382 insertions(+), 75 deletions(-)

Subject: [PATCH] ndctl: add 'devdax' support for NDCTL_PFN_LOC_NONE

diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 7fb00078646b..2568943eb207 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -206,6 +206,8 @@ static int set_defaults(enum device_action mode)
 			/* pass */;
 		else if (strcmp(param.map, "dev") == 0)
 			/* pass */;
+		else if (strcmp(param.map, "none") == 0)
+			/* pass */;
 		else {
 			error("invalid map location '%s'\n", param.map);
 			rc = -EINVAL;
@@ -755,9 +757,17 @@ static int validate_namespace_options(struct ndctl_region *region,
 	if (param.map) {
 		if (!strcmp(param.map, "mem"))
 			p->loc = NDCTL_PFN_LOC_RAM;
+		else if (!strcmp(param.map, "none"))
+			p->loc = NDCTL_PFN_LOC_NONE;
 			p->loc = NDCTL_PFN_LOC_PMEM;
+		if (p->loc == NDCTL_PFN_LOC_NONE
+			&& p->mode != NDCTL_NS_MODE_DAX) {
+			debug("--map=none only valid for devdax mode namespace\n");
+			return -EINVAL;
+		}
 		if (ndns && p->mode != NDCTL_NS_MODE_MEMORY
 			&& p->mode != NDCTL_NS_MODE_DAX) {
 			debug("%s: --map= only valid for fsdax mode namespace\n",


             reply index

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-10 19:03 Joao Martins [this message]
2020-01-10 19:03 ` [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL Joao Martins
2020-02-03 21:34   ` Matthew Wilcox
2020-02-04 16:14     ` Joao Martins
2020-01-10 19:03 ` [PATCH RFC 02/10] mm: Handle pmd entries in follow_pfn() Joao Martins
2020-02-03 21:37   ` Matthew Wilcox
2020-02-04 16:17     ` Joao Martins
2020-01-10 19:03 ` [PATCH RFC 03/10] mm: Add pud support for _PAGE_SPECIAL Joao Martins
2020-01-10 19:03 ` [PATCH RFC 04/10] mm: Handle pud entries in follow_pfn() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 05/10] device-dax: Do not enforce MADV_DONTFORK on mmap() Joao Martins
2020-01-10 19:03 ` [PATCH RFC 06/10] device-dax: Introduce pfn_flags helper Joao Martins
2020-01-10 19:03 ` [PATCH RFC 07/10] device-dax: Add support for PFN_SPECIAL flags Joao Martins
2020-01-10 19:03 ` [PATCH RFC 08/10] dax/pmem: Add device-dax support for PFN_MODE_NONE Joao Martins
2020-01-10 19:03 ` [PATCH RFC 09/10] vfio/type1: Use follow_pfn for VM_FPNMAP VMAs Joao Martins
2020-02-07 21:08   ` Jason Gunthorpe
2020-02-11 16:23     ` Joao Martins
2020-02-11 16:50       ` Jason Gunthorpe
2020-01-10 19:03 ` [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support Joao Martins
2020-02-04 15:28   ` Barret Rhoden
2020-02-04 16:44     ` Dan Williams
2020-02-04 18:20       ` Barret Rhoden
2020-02-04 19:24         ` Joao Martins
2020-02-04 21:43         ` Dan Williams
2020-02-04 21:57           ` Barret Rhoden
2020-02-04  1:24 ` [PATCH RFC 00/10] device-dax: Support devices without PFN metadata Dan Williams
2020-02-04 19:07   ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200110190313.17144-1-joao.m.martins@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=brho@google.com \
    --cc=cohuck@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=hpa@zytor.com \
    --cc=ira.weiny@intel.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=liran.alon@oracle.com \
    --cc=mingo@redhat.com \
    --cc=nikita.leshchenko@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git