nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] KVM "fake DAX" device flushing
@ 2017-10-12 15:50 Pankaj Gupta
  2017-10-12 15:50 ` [RFC 1/2] pmem: Move reusable code to base header files Pankaj Gupta
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 15:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric,
	pagupta

We are sharing the prototype version of 'fake DAX' flushing
interface for the initial feedback. This is still work in progress
and not yet ready for merging.

Prototype right now just implements basic functionality without advanced
features with two major parts:

- Qemu virtio-pmem device
  It exposes a persistent memory range to KVM guest which at host side is file
  backed memory and works as persistent memory device. In addition to this it
  provides a virtio flushing interface for KVM guest to do a Qemu side sync for
  guest DAX persistent memory range.  

- Guest virtio-pmem driver
  Reads persistent memory range from paravirt device and reserves system memory map.
  It also allocates a block device corresponding to the pmem range which is accessed
  by DAX capable file systems. (file system support is still pending).  
  
We shared the project idea for 'fake DAX' flushing interface here [1].
Based on suggestions here [2], we implemented guest 'virtio-pmem'
driver and Qemu paravirt device.

[1] https://www.spinics.net/lists/kvm/msg149761.html
[2] https://www.spinics.net/lists/kvm/msg153095.html

Work yet to be done:

- Separate out the common code used by ACPI pmem interface and
  reuse it.

- In pmem device memmap allocation and working. There is some parallel work
  going on upstream related to 'memory_hotplug restructuring' [3] and also hitting
  a memory section alignment issue [4].
  
  [3] https://lwn.net/Articles/712099/
  [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html
  
- Provide DAX capable file-system(ext4 & XFS) support.
- Qemu device flush functionality trigger with guest fsync on file.
- Qemu live migration work when host page cache is used.
- Multiple virtio-pmem disks support.
- Prepare virtio spec after we get feedback on current approach.

 drivers/nvdimm/pfn.h             |   14 -
 drivers/nvdimm/pfn_devs.c        |   20 --
 drivers/nvdimm/pmem.c            |   40 ----
 drivers/nvdimm/pmem.h            |    5 
 drivers/virtio/Kconfig           |   10 +
 drivers/virtio/Makefile          |    1 
 drivers/virtio/virtio_pmem.c     |  322 +++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h         |   23 ++
 include/linux/pfn.h              |   15 +
 include/linux/pmem_common.h      |   52 ++++++
 include/uapi/linux/virtio_pmem.h |   55 ++++++
 11 files changed, 479 insertions(+), 78 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC 1/2] pmem: Move reusable code to base header files
  2017-10-12 15:50 [RFC 0/2] KVM "fake DAX" device flushing Pankaj Gupta
@ 2017-10-12 15:50 ` Pankaj Gupta
  2017-10-12 20:42   ` Dan Williams
       [not found] ` <20171012155027.3277-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-10-12 15:50 ` [RFC] QEMU: Add virtio pmem device Pankaj Gupta
  2 siblings, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 15:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric,
	pagupta

 This patch moves common code to base header files
 so that it can be used for both ACPI pmem and VIRTIO pmem
 drivers. More common code needs to be moved out in future
 based on functionality required for virtio_pmem driver and 
 coupling of code with existing ACPI pmem driver.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/nvdimm/pfn.h        | 14 ------------
 drivers/nvdimm/pfn_devs.c   | 20 -----------------
 drivers/nvdimm/pmem.c       | 40 ----------------------------------
 drivers/nvdimm/pmem.h       |  5 +----
 include/linux/memremap.h    | 23 ++++++++++++++++++++
 include/linux/pfn.h         | 15 +++++++++++++
 include/linux/pmem_common.h | 52 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 91 insertions(+), 78 deletions(-)
 create mode 100644 include/linux/pmem_common.h

diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dde9853453d3..1a853f651faf 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -40,18 +40,4 @@ struct nd_pfn_sb {
 	__le64 checksum;
 };
 
-#ifdef CONFIG_SPARSEMEM
-#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x)
-#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x)
-#else
-/*
- * In this case ZONE_DEVICE=n and we will disable 'pfn' device support,
- * but we still want pmem to compile.
- */
-#define PFN_SECTION_ALIGN_DOWN(x) (x)
-#define PFN_SECTION_ALIGN_UP(x) (x)
-#endif
-
-#define PHYS_SECTION_ALIGN_DOWN(x) PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x)))
-#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x)))
 #endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 9576c444f0ab..52d6923e92fc 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -513,26 +513,6 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
 }
 EXPORT_SYMBOL(nd_pfn_probe);
 
-/*
- * We hotplug memory at section granularity, pad the reserved area from
- * the previous section base to the namespace base address.
- */
-static unsigned long init_altmap_base(resource_size_t base)
-{
-	unsigned long base_pfn = PHYS_PFN(base);
-
-	return PFN_SECTION_ALIGN_DOWN(base_pfn);
-}
-
-static unsigned long init_altmap_reserve(resource_size_t base)
-{
-	unsigned long reserve = PHYS_PFN(SZ_8K);
-	unsigned long base_pfn = PHYS_PFN(base);
-
-	reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
-	return reserve;
-}
-
 static struct vmem_altmap *__nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
 		struct resource *res, struct vmem_altmap *altmap)
 {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 39dfd7affa31..5075131b715b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -77,46 +77,6 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	return rc;
 }
 
-static void write_pmem(void *pmem_addr, struct page *page,
-		unsigned int off, unsigned int len)
-{
-	unsigned int chunk;
-	void *mem;
-
-	while (len) {
-		mem = kmap_atomic(page);
-		chunk = min_t(unsigned int, len, PAGE_SIZE);
-		memcpy_flushcache(pmem_addr, mem + off, chunk);
-		kunmap_atomic(mem);
-		len -= chunk;
-		off = 0;
-		page++;
-		pmem_addr += PAGE_SIZE;
-	}
-}
-
-static blk_status_t read_pmem(struct page *page, unsigned int off,
-		void *pmem_addr, unsigned int len)
-{
-	unsigned int chunk;
-	int rc;
-	void *mem;
-
-	while (len) {
-		mem = kmap_atomic(page);
-		chunk = min_t(unsigned int, len, PAGE_SIZE);
-		rc = memcpy_mcsafe(mem + off, pmem_addr, chunk);
-		kunmap_atomic(mem);
-		if (rc)
-			return BLK_STS_IOERR;
-		len -= chunk;
-		off = 0;
-		page++;
-		pmem_addr += PAGE_SIZE;
-	}
-	return BLK_STS_OK;
-}
-
 static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			unsigned int len, unsigned int off, bool is_write,
 			sector_t sector)
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index c5917f040fa7..8c5620614ec0 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,9 +1,6 @@
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
-#include <linux/badblocks.h>
-#include <linux/types.h>
-#include <linux/pfn_t.h>
-#include <linux/fs.h>
+#include <linux/pmem_common.h>
 
 /* this definition is in it's own header for tools/testing/nvdimm to consume */
 struct pmem_device {
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79f8ba7c3894..e4eb81020306 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -3,12 +3,35 @@
 #include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 
 #include <asm/pgtable.h>
 
 struct resource;
 struct device;
 
+/*
+ * We hotplug memory at section granularity, pad the reserved area from
+ * the previous section base to the namespace base address.
+ */
+static inline unsigned long init_altmap_base(resource_size_t base)
+{
+	unsigned long base_pfn = PHYS_PFN(base);
+
+	return PFN_SECTION_ALIGN_DOWN(base_pfn);
+}
+
+static inline unsigned long init_altmap_reserve(resource_size_t base)
+{
+	unsigned long reserve = PHYS_PFN(SZ_8K);
+	unsigned long base_pfn = PHYS_PFN(base);
+
+	reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
+	return reserve;
+}
+
+
 /**
  * struct vmem_altmap - pre-allocated storage for vmemmap_populate
  * @base_pfn: base of the entire dev_pagemap mapping
diff --git a/include/linux/pfn.h b/include/linux/pfn.h
index 1132953235c0..2d8f69cc1470 100644
--- a/include/linux/pfn.h
+++ b/include/linux/pfn.h
@@ -20,4 +20,19 @@ typedef struct {
 #define PFN_PHYS(x)	((phys_addr_t)(x) << PAGE_SHIFT)
 #define PHYS_PFN(x)	((unsigned long)((x) >> PAGE_SHIFT))
 
+#ifdef CONFIG_SPARSEMEM
+#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x)
+#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x)
+#else
+/*
+ * In this case ZONE_DEVICE=n and we will disable 'pfn' device support,
+ * but we still want pmem to compile.
+ */
+#define PFN_SECTION_ALIGN_DOWN(x) (x)
+#define PFN_SECTION_ALIGN_UP(x) (x)
+#endif
+
+#define PHYS_SECTION_ALIGN_DOWN(x) PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x)))
+#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x)))
+
 #endif
diff --git a/include/linux/pmem_common.h b/include/linux/pmem_common.h
new file mode 100644
index 000000000000..e2e718c74b3f
--- /dev/null
+++ b/include/linux/pmem_common.h
@@ -0,0 +1,52 @@
+#ifndef __PMEM_COMMON_H__
+#define __PMEM_COMMON_H__
+
+#include <linux/badblocks.h>
+#include <linux/types.h>
+#include <linux/pfn_t.h>
+#include <linux/fs.h>
+#include <linux/pfn_t.h>
+#include <linux/memremap.h>
+#include <linux/vmalloc.h>
+#include <linux/mmzone.h>
+#include <linux/dax.h>
+#include <linux/highmem.h>
+#include <linux/blkdev.h>
+
+static void write_pmem(void *pmem_addr, struct page *page,
+	unsigned int off, unsigned int len)
+{
+	void *mem = kmap_atomic(page);
+
+	memcpy_flushcache(pmem_addr, mem + off, len);
+	kunmap_atomic(mem);
+}
+
+static blk_status_t read_pmem(struct page *page, unsigned int off,
+	void *pmem_addr, unsigned int len)
+{
+	int rc;
+	void *mem = kmap_atomic(page);
+
+	rc = memcpy_mcsafe(mem + off, pmem_addr, len);
+	kunmap_atomic(mem);
+	if (rc)
+		return BLK_STS_IOERR;
+	return BLK_STS_OK;
+}
+
+#endif /* __PMEM_COMMON_H__ */
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
+void arch_wb_cache_pmem(void *addr, size_t size);
+void arch_invalidate_pmem(void *addr, size_t size);
+#else
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
+#endif
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC 2/2] KVM: add virtio-pmem driver
       [not found] ` <20171012155027.3277-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-12 15:50   ` Pankaj Gupta
  2017-10-12 20:51     ` Dan Williams
                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 15:50 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA,
	jack-AlSwsSmVLrQ, xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w,
	david-H+wXaHxf7aLQT0dZR+AlfA, pagupta-H+wXaHxf7aLQT0dZR+AlfA,
	ross.zwisler-ral2JQCrhuEAvxtiuMwx3w,
	stefanha-H+wXaHxf7aLQT0dZR+AlfA, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	nilal-H+wXaHxf7aLQT0dZR+AlfA

  This patch adds virtio-pmem driver for KVM guest.
  Guest reads the persistent memory range information
  over virtio bus from Qemu and reserves the range
  as persistent memory. Guest also allocates a block
  device corresponding to the pmem range which later
  can be accessed with DAX compatible file systems.
  Idea is to use the virtio channel between guest and
  host to perform the block device flush for guest pmem 
  DAX device.

  There is work to do including DAX file system support 
  and other advanced features.

Signed-off-by: Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/Kconfig           |  10 ++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/virtio_pmem.c     | 322 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_pmem.h |  55 +++++++
 4 files changed, 388 insertions(+)
 create mode 100644 drivers/virtio/virtio_pmem.c
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cff773f15b7e..0192c4bda54b 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Virtio pmem driver"
+	depends on VIRTIO
+	---help---
+	 This driver adds persistent memory range within a KVM guest.
+         It also associates a block device corresponding to the pmem
+	 range.
+
+	 If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 41e30e3dc842..032ade725cc2 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -5,3 +5,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
new file mode 100644
index 000000000000..74e47cae0e24
--- /dev/null
+++ b/drivers/virtio/virtio_pmem.c
@@ -0,0 +1,322 @@
+/*
+ * virtio-pmem driver
+ */
+
+#include <linux/virtio.h>
+#include <linux/swap.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/oom.h>
+#include <linux/wait.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/magic.h>
+#include <linux/virtio_pmem.h>
+
+void devm_vpmem_disable(struct device *dev, struct resource *res, void *addr)
+{
+	devm_memunmap(dev, addr);
+	devm_release_mem_region(dev, res->start, resource_size(res));
+}
+
+static void pmem_flush_done(struct virtqueue *vq)
+{
+	return;
+};
+
+static void virtio_pmem_release_queue(void *q)
+{
+	blk_cleanup_queue(q);
+}
+
+static void virtio_pmem_freeze_queue(void *q)
+{
+	blk_freeze_queue_start(q);
+}
+
+static void virtio_pmem_release_disk(void *__pmem)
+{
+	struct virtio_pmem *pmem = __pmem;
+
+	del_gendisk(pmem->disk);
+	put_disk(pmem->disk);
+}
+
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vq = virtio_find_single_vq(vpmem->vdev, pmem_flush_done, "flush_queue");
+
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	return 0;
+}
+
+static struct vmem_altmap *setup_pmem_pfn(struct virtio_pmem *vpmem,
+			struct resource *res, struct vmem_altmap *altmap)
+{
+	u32 start_pad = 0, end_trunc = 0;
+	resource_size_t start, size;
+	unsigned long npfns;
+	phys_addr_t offset;
+
+	size = resource_size(res);
+	start = PHYS_SECTION_ALIGN_DOWN(res->start);
+
+	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
+		IORES_DESC_NONE) == REGION_MIXED) {
+
+		start = res->start;
+		start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
+	}
+	start = res->start;
+	size = PHYS_SECTION_ALIGN_UP(start + size) - start;
+
+	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
+		IORES_DESC_NONE) == REGION_MIXED) {
+
+		size = resource_size(res);
+		end_trunc = start + size -
+				PHYS_SECTION_ALIGN_DOWN(start + size);
+	}
+
+	start += start_pad;
+	size = resource_size(res);
+	npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - SZ_8K)
+						/ PAGE_SIZE);
+
+      /*
+       * vmemmap_populate_hugepages() allocates the memmap array in
+       * HPAGE_SIZE chunks.
+       */
+	offset = ALIGN(start + SZ_8K + 64 * npfns, HPAGE_SIZE) - start;
+	vpmem->data_offset = offset;
+
+	struct vmem_altmap __altmap = {
+		.base_pfn = init_altmap_base(start+start_pad),
+		.reserve = init_altmap_reserve(start+start_pad),
+	};
+
+	res->start += start_pad;
+	res->end -= end_trunc;
+	memcpy(altmap, &__altmap, sizeof(*altmap));
+	altmap->free = PHYS_PFN(offset - SZ_8K);
+	altmap->alloc = 0;
+
+	return altmap;
+}
+
+static blk_status_t pmem_do_bvec(struct virtio_pmem *pmem, struct page *page,
+			unsigned int len, unsigned int off, bool is_write,
+			sector_t sector)
+{
+	blk_status_t rc = BLK_STS_OK;
+	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_off;
+
+	if (!is_write) {
+		rc = read_pmem(page, off, pmem_addr, len);
+			flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		write_pmem(pmem_addr, page, off, len);
+	}
+
+	return rc;
+}
+
+static int vpmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, bool is_write)
+{
+	struct virtio_pmem  *pmem = bdev->bd_queue->queuedata;
+	blk_status_t rc;
+
+	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
+			  0, is_write, sector);
+
+	if (rc == 0)
+		page_endio(page, is_write, 0);
+
+	return blk_status_to_errno(rc);
+}
+
+#ifndef REQ_FLUSH
+#define REQ_FLUSH REQ_PREFLUSH
+#endif
+
+static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
+			struct bio *bio)
+{
+	blk_status_t rc = 0;
+	struct bio_vec bvec;
+	struct bvec_iter iter;
+	struct virtio_pmem *pmem = q->queuedata;
+
+	if (bio->bi_opf & REQ_FLUSH)
+		//todo host flush command
+
+	bio_for_each_segment(bvec, bio, iter) {
+		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
+				bvec.bv_offset, op_is_write(bio_op(bio)),
+				iter.bi_sector);
+		if (rc) {
+			bio->bi_status = rc;
+			break;
+		}
+	}
+
+	bio_endio(bio);
+	return BLK_QC_T_NONE;
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		vpmem_rw_page,
+	//.revalidate_disk =	nvdimm_revalidate_disk,
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	struct virtio_pmem *vpmem;
+	int err = 0;
+	void *addr;
+	struct resource *res, res_pfn;
+	struct request_queue *q;
+	struct vmem_altmap __altmap, *altmap = NULL;
+	struct gendisk *disk;
+	struct device *gendev;
+	int nid = dev_to_node(&vdev->dev);
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+			GFP_KERNEL);
+
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	dev_set_drvdata(&vdev->dev, vpmem);
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out;
+
+	if (!virtio_has_feature(vdev, VIRTIO_PMEM_PLUG)) {
+		dev_err(&vdev->dev, "%s : pmem not supported\n",
+			__func__);
+		goto out;
+	}
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+
+	res_pfn.start = vpmem->start;
+	res_pfn.end   = vpmem->start + vpmem->size-1;
+
+	/* used for allocating memmap in the pmem device */
+	altmap	      = setup_pmem_pfn(vpmem, &res_pfn, &__altmap);
+
+	res = devm_request_mem_region(&vdev->dev,
+			res_pfn.start, resource_size(&res_pfn), "virtio-pmem");
+
+	if (!res) {
+		dev_warn(&vdev->dev, "could not reserve region ");
+		return -EBUSY;
+	}
+
+	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(&vdev->dev));
+
+	if (!q)
+		return -ENOMEM;
+
+	if (devm_add_action_or_reset(&vdev->dev,
+				virtio_pmem_release_queue, q))
+		return -ENOMEM;
+
+	vpmem->pfn_flags = PFN_DEV;
+
+	/* allocate memap in pmem device itself */
+	if (IS_ENABLED(CONFIG_ZONE_DEVICE)) {
+
+		addr = devm_memremap_pages(&vdev->dev, res,
+				&q->q_usage_counter, altmap);
+		vpmem->pfn_flags |= PFN_MAP;
+	} else
+		addr = devm_memremap(&vdev->dev, vpmem->start,
+				vpmem->size, ARCH_MEMREMAP_PMEM);
+
+        /*
+         * At release time the queue must be frozen before
+         * devm_memremap_pages is unwound
+         */
+	if (devm_add_action_or_reset(&vdev->dev,
+				virtio_pmem_freeze_queue, q))
+		return -ENOMEM;
+
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	vpmem->virt_addr = addr;
+	blk_queue_write_cache(q, 0, 0);
+	blk_queue_make_request(q, virtio_pmem_make_request);
+	blk_queue_physical_block_size(q, PAGE_SIZE);
+	blk_queue_logical_block_size(q, 512);
+	blk_queue_max_hw_sectors(q, UINT_MAX);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
+	q->queuedata = vpmem;
+
+	disk = alloc_disk_node(0, nid);
+	if (!disk)
+		return -ENOMEM;
+	vpmem->disk = disk;
+
+	disk->fops                = &pmem_fops;
+	disk->queue               = q;
+	disk->flags               = GENHD_FL_EXT_DEVT;
+	strcpy(disk->disk_name, "vpmem");
+	set_capacity(disk, vpmem->size/512);
+	gendev = disk_to_dev(disk);
+
+	virtio_device_ready(vdev);
+	device_add_disk(&vdev->dev, disk);
+
+	if (devm_add_action_or_reset(&vdev->dev,
+			virtio_pmem_release_disk, vpmem))
+		return -ENOMEM;
+
+	revalidate_disk(disk);
+	return 0;
+out:
+	vdev->config->del_vqs(vdev);
+	return err;
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+	.feature_table		= features,
+	.feature_table_size	= ARRAY_SIZE(features),
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	//.remove		= virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 000000000000..ec0c728c79ba
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,55 @@
+/*
+ * Virtio pmem Device
+ *
+ *
+ */
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/types.h>
+#include <linux/virtio_types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/pfn_t.h>
+#include <linux/fs.h>
+#include <linux/blk-mq.h>
+#include <linux/pmem_common.h>
+
+bool pmem_should_map_pages(struct device *dev);
+
+#define VIRTIO_PMEM_PLUG 0
+
+struct virtio_pmem_config {
+
+uint64_t start;
+uint64_t size;
+uint64_t align;
+uint64_t data_offset;
+};
+
+struct virtio_pmem {
+
+	struct virtio_device *vdev;
+	struct virtqueue *req_vq;
+
+	uint64_t start;
+	uint64_t size;
+	uint64_t align;
+	uint64_t data_offset;
+	u64 pfn_flags;
+	void *virt_addr;
+	struct gendisk *disk;
+} __packed;
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static unsigned int features[] = {
+	VIRTIO_PMEM_PLUG,
+};
+
+#endif
+
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC] QEMU: Add virtio pmem device
  2017-10-12 15:50 [RFC 0/2] KVM "fake DAX" device flushing Pankaj Gupta
  2017-10-12 15:50 ` [RFC 1/2] pmem: Move reusable code to base header files Pankaj Gupta
       [not found] ` <20171012155027.3277-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-12 15:50 ` Pankaj Gupta
  2 siblings, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 15:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric,
	pagupta

 This patch adds virtio-pmem Qemu device.

 This device configures memory address range information with file
 backend type. It acts like persistent memory device for KVM guest.
 It presents the memory address range to virtio-pmem driver over
 virtio channel and does the block flush whenever there is request
 from guest to flush/sync. (later part is yet to be implemented).

 Current code is a prototype to support guest with persistent 
 memory range & DAX.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 hw/virtio/Makefile.objs                     |   2 +-
 hw/virtio/virtio-pci.c                      |  44 ++++++++++
 hw/virtio/virtio-pci.h                      |  14 +++
 hw/virtio/virtio-pmem.c                     | 130 ++++++++++++++++++++++++++++
 include/hw/pci/pci.h                        |   1 +
 include/hw/virtio/virtio-pmem.h             |  42 +++++++++
 include/standard-headers/linux/virtio_ids.h |   1 +
 7 files changed, 233 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/virtio-pmem.c
 create mode 100644 include/hw/virtio/virtio-pmem.h

diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 765d363c1f..bb5573d2ef 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -5,7 +5,7 @@ common-obj-y += virtio-bus.o
 common-obj-y += virtio-mmio.o
 
 obj-y += virtio.o virtio-balloon.o 
-obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
+obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o virtio-pmem.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
 obj-y += virtio-crypto.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio-crypto-pci.o
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 8b0d6b69cd..eb56afb290 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2473,6 +2473,49 @@ static const TypeInfo virtio_rng_pci_info = {
     .class_init    = virtio_rng_pci_class_init,
 };
 
+/* virtio-pmem-pci */
+
+static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+    VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
+    DeviceState *vdev = DEVICE(&vpmem->vdev);
+
+    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
+    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
+}
+
+static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+    k->realize = virtio_pmem_pci_realize;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
+    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+    pcidev_k->class_id = PCI_CLASS_OTHERS;
+}
+
+static void virtio_pmem_pci_instance_init(Object *obj)
+{
+    VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
+
+    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+                                TYPE_VIRTIO_PMEM);
+    object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
+                              &error_abort);
+}
+
+static const TypeInfo virtio_pmem_pci_info = {
+    .name          = TYPE_VIRTIO_PMEM_PCI,
+    .parent        = TYPE_VIRTIO_PCI,
+    .instance_size = sizeof(VirtIOPMEMPCI),
+    .instance_init = virtio_pmem_pci_instance_init,
+    .class_init    = virtio_pmem_pci_class_init,
+};
+
+
 /* virtio-input-pci */
 
 static Property virtio_input_pci_properties[] = {
@@ -2662,6 +2705,7 @@ static void virtio_pci_register_types(void)
     type_register_static(&virtio_balloon_pci_info);
     type_register_static(&virtio_serial_pci_info);
     type_register_static(&virtio_net_pci_info);
+    type_register_static(&virtio_pmem_pci_info);
 #ifdef CONFIG_VHOST_SCSI
     type_register_static(&vhost_scsi_pci_info);
 #endif
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index 69f5959623..37f0eeabd2 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -19,6 +19,7 @@
 #include "hw/virtio/virtio-blk.h"
 #include "hw/virtio/virtio-net.h"
 #include "hw/virtio/virtio-rng.h"
+#include "hw/virtio/virtio-pmem.h"
 #include "hw/virtio/virtio-serial.h"
 #include "hw/virtio/virtio-scsi.h"
 #include "hw/virtio/virtio-balloon.h"
@@ -53,6 +54,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
 typedef struct VirtIOGPUPCI VirtIOGPUPCI;
 typedef struct VHostVSockPCI VHostVSockPCI;
 typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
+typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
 
 /* virtio-pci-bus */
 
@@ -254,6 +256,18 @@ struct VirtIOBlkPCI {
 };
 
 /*
+ * virtio-pmem-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
+#define VIRTIO_PMEM_PCI(obj) \
+        OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
+
+struct VirtIOPMEMPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIOPMEM vdev;
+};
+
+/*
  * virtio-balloon-pci: This extends VirtioPCIProxy.
  */
 #define TYPE_VIRTIO_BALLOON_PCI "virtio-balloon-pci"
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
new file mode 100644
index 0000000000..6ba1fd1614
--- /dev/null
+++ b/hw/virtio/virtio-pmem.c
@@ -0,0 +1,130 @@
+/*
+ * Virtio pmem device
+ *
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "hw/virtio/virtio-pmem.h"
+
+
+static void virtio_pmem_system_reset(void *opaque)
+{
+
+}
+
+static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
+{
+    /* VirtIOPMEM  *vm = VIRTIO_PMEM(vdev); */
+}
+
+
+static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
+{
+    VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
+    struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
+
+    pmemcfg->start = pmem->start;
+    pmemcfg->size  = pmem->size;
+    pmemcfg->align = pmem->align;
+}
+
+static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
+                                        Error **errp)
+{
+    virtio_add_feature(&features, VIRTIO_PMEM_PLUG);
+    return features;
+}
+
+
+
+static void virtio_pmem_realize(DeviceState *dev, Error **errp)
+{
+    VirtIODevice   *vdev   = VIRTIO_DEVICE(dev);
+    VirtIOPMEM     *pmem   = VIRTIO_PMEM(dev);
+    MachineState   *ms     = MACHINE(qdev_get_machine());
+    MemoryRegion   *mr;
+    PCMachineState *pcms   =
+        PC_MACHINE(object_dynamic_cast(OBJECT(ms), TYPE_PC_MACHINE));
+
+    if (!pmem->memdev) {
+        error_setg(errp, "virtio-pmem not set");
+        return;
+    }
+
+    pmem->start = pcms->hotplug_memory.base;
+
+    /*if (!pcmc->broken_reserved_end) {
+            pmem->size = memory_region_size(&pcms->hotplug_memory.mr);
+    }*/
+
+
+    mr               = host_memory_backend_get_memory(pmem->memdev, errp);
+    pmem->size       = memory_region_size(mr);
+    pmem->align      = memory_region_get_alignment(mr);
+
+    virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
+                sizeof(struct virtio_pmem_config));
+
+    pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
+
+    host_memory_backend_set_mapped(pmem->memdev, true);
+    qemu_register_reset(virtio_pmem_system_reset, pmem);
+}
+
+static void virtio_mem_check_memdev(Object *obj, const char *name, Object *val,
+                                    Error **errp)
+{
+
+    if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
+
+        char *path = object_get_canonical_path_component(val);
+        error_setg(errp, "Can't use already busy memdev: %s", path);
+        g_free(path);
+        return;
+    }
+
+    qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
+}
+
+static void virtio_pmem_instance_init(Object *obj)
+{
+
+    VirtIOPMEM *vm = VIRTIO_PMEM(obj);
+
+    object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
+                             (Object **)&vm->memdev,
+                             (void *) virtio_mem_check_memdev,
+                             OBJ_PROP_LINK_UNREF_ON_RELEASE,
+                             &error_abort);
+
+}
+
+
+static void virtio_pmem_class_init(ObjectClass *klass, void *data)
+{
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+
+    vdc->realize      =  virtio_pmem_realize;
+    vdc->get_config   =  virtio_pmem_get_config;
+    vdc->get_features =  virtio_pmem_get_features;
+}
+
+static TypeInfo virtio_pmem_info = {
+    .name          = TYPE_VIRTIO_PMEM,
+    .parent        = TYPE_VIRTIO_DEVICE,
+    .class_size    = sizeof(VirtIOPMEM),
+    .class_init    = virtio_pmem_class_init,
+    .instance_init = virtio_pmem_instance_init,
+};
+
+
+static void virtio_register_types(void)
+{
+    type_register_static(&virtio_pmem_info);
+}
+
+type_init(virtio_register_types)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 8bb6449dd7..b9bfb79a11 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -85,6 +85,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_VIRTIO_RNG         0x1005
 #define PCI_DEVICE_ID_VIRTIO_9P          0x1009
 #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
+#define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
 
 #define PCI_VENDOR_ID_REDHAT             0x1b36
 #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
diff --git a/include/hw/virtio/virtio-pmem.h b/include/hw/virtio/virtio-pmem.h
new file mode 100644
index 0000000000..0d0642ed82
--- /dev/null
+++ b/include/hw/virtio/virtio-pmem.h
@@ -0,0 +1,42 @@
+/*
+ * Virtio pmem Device
+ *
+ *
+ */
+
+#ifndef QEMU_VIRTIO_PMEM_H
+#define QEMU_VIRTIO_PMEM_H
+
+#include "hw/virtio/virtio.h"
+#include "exec/memory.h"
+#include "sysemu/hostmem.h"
+#include "standard-headers/linux/virtio_ids.h"
+#include "hw/boards.h"
+#include "hw/i386/pc.h"
+
+#define VIRTIO_PMEM_PLUG 0
+
+#define TYPE_VIRTIO_PMEM "virtio-pmem"
+
+#define VIRTIO_PMEM(obj) \
+        OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
+
+typedef struct VirtIOPMEM {
+
+    VirtIODevice parent_obj;
+    uint64_t start;
+    uint64_t size;
+    uint64_t align;
+
+    VirtQueue *rq_vq;
+    MemoryRegion *mr;
+    HostMemoryBackend *memdev;
+} VirtIOPMEM;
+
+struct virtio_pmem_config {
+
+    uint64_t start;
+    uint64_t size;
+    uint64_t align;
+};
+#endif
diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index 6d5c3b2d4f..5ebd04980d 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         21 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC 1/2] pmem: Move reusable code to base header files
  2017-10-12 15:50 ` [RFC 1/2] pmem: Move reusable code to base header files Pankaj Gupta
@ 2017-10-12 20:42   ` Dan Williams
  2017-10-12 21:27     ` [Qemu-devel] " Pankaj Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-12 20:42 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Rik van Riel, Haozhong Zhang,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Zwisler, Ross,
	David Hildenbrand, Xiao Guangrong

On Thu, Oct 12, 2017 at 8:50 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>  This patch moves common code to base header files
>  so that it can be used for both ACPI pmem and VIRTIO pmem
>  drivers. More common code needs to be moved out in future
>  based on functionality required for virtio_pmem driver and
>  coupling of code with existing ACPI pmem driver.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
[..]
> diff --git a/include/linux/pmem_common.h b/include/linux/pmem_common.h
> new file mode 100644
> index 000000000000..e2e718c74b3f
> --- /dev/null
> +++ b/include/linux/pmem_common.h

This should be a common C file, not a header.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 15:50   ` [RFC 2/2] KVM: add virtio-pmem driver Pankaj Gupta
@ 2017-10-12 20:51     ` Dan Williams
  2017-10-12 21:25       ` Pankaj Gupta
  2017-10-13  9:44     ` Stefan Hajnoczi
       [not found]     ` <20171012155027.3277-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-12 20:51 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Rik van Riel, Haozhong Zhang,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Zwisler, Ross,
	David Hildenbrand, Xiao Guangrong

On Thu, Oct 12, 2017 at 8:50 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
>   This patch adds virtio-pmem driver for KVM guest.
>   Guest reads the persistent memory range information
>   over virtio bus from Qemu and reserves the range
>   as persistent memory. Guest also allocates a block
>   device corresponding to the pmem range which later
>   can be accessed with DAX compatible file systems.
>   Idea is to use the virtio channel between guest and
>   host to perform the block device flush for guest pmem
>   DAX device.
>
>   There is work to do including DAX file system support
>   and other advanced features.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/virtio/Kconfig           |  10 ++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/virtio_pmem.c     | 322 +++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_pmem.h |  55 +++++++
>  4 files changed, 388 insertions(+)
>  create mode 100644 drivers/virtio/virtio_pmem.c
>  create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index cff773f15b7e..0192c4bda54b 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
>
>           If unsure, say Y.
>
> +config VIRTIO_PMEM
> +       tristate "Virtio pmem driver"
> +       depends on VIRTIO
> +       ---help---
> +        This driver adds persistent memory range within a KVM guest.

I think we need to call this something other than persistent memory to
make it clear that this not memory where the persistence can be
managed from userspace. The persistence point always requires a driver
call, so this is something distinctly different than "persistent
memory". For example, it's a bug if this memory range ends up backing
a device-dax range in the guest where there is no such thing as a
driver callback to perform the flushing. How does this solution
protect against that scenario?

> +         It also associates a block device corresponding to the pmem
> +        range.
> +
> +        If unsure, say M.
> +
>  config VIRTIO_BALLOON
>         tristate "Virtio balloon driver"
>         depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 41e30e3dc842..032ade725cc2 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -5,3 +5,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 000000000000..74e47cae0e24
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,322 @@
> +/*
> + * virtio-pmem driver
> + */
> +
> +#include <linux/virtio.h>
> +#include <linux/swap.h>
> +#include <linux/workqueue.h>
> +#include <linux/delay.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/oom.h>
> +#include <linux/wait.h>
> +#include <linux/mm.h>
> +#include <linux/mount.h>
> +#include <linux/magic.h>
> +#include <linux/virtio_pmem.h>
> +
> +void devm_vpmem_disable(struct device *dev, struct resource *res, void *addr)
> +{
> +       devm_memunmap(dev, addr);
> +       devm_release_mem_region(dev, res->start, resource_size(res));
> +}
> +
> +static void pmem_flush_done(struct virtqueue *vq)
> +{
> +       return;
> +};
> +
> +static void virtio_pmem_release_queue(void *q)
> +{
> +       blk_cleanup_queue(q);
> +}
> +
> +static void virtio_pmem_freeze_queue(void *q)
> +{
> +       blk_freeze_queue_start(q);
> +}
> +
> +static void virtio_pmem_release_disk(void *__pmem)
> +{
> +       struct virtio_pmem *pmem = __pmem;
> +
> +       del_gendisk(pmem->disk);
> +       put_disk(pmem->disk);
> +}

This code seems identical to the base pmem case, it should move to the
shared code object.

> +
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> +       struct virtqueue *vq;
> +
> +       /* single vq */
> +       vq = virtio_find_single_vq(vpmem->vdev, pmem_flush_done, "flush_queue");
> +
> +       if (IS_ERR(vq))
> +               return PTR_ERR(vq);
> +
> +       return 0;
> +}
> +
> +static struct vmem_altmap *setup_pmem_pfn(struct virtio_pmem *vpmem,
> +                       struct resource *res, struct vmem_altmap *altmap)
> +{
> +       u32 start_pad = 0, end_trunc = 0;
> +       resource_size_t start, size;
> +       unsigned long npfns;
> +       phys_addr_t offset;
> +
> +       size = resource_size(res);
> +       start = PHYS_SECTION_ALIGN_DOWN(res->start);
> +
> +       if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> +               IORES_DESC_NONE) == REGION_MIXED) {
> +
> +               start = res->start;
> +               start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
> +       }
> +       start = res->start;
> +       size = PHYS_SECTION_ALIGN_UP(start + size) - start;
> +
> +       if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> +               IORES_DESC_NONE) == REGION_MIXED) {
> +
> +               size = resource_size(res);
> +               end_trunc = start + size -
> +                               PHYS_SECTION_ALIGN_DOWN(start + size);
> +       }
> +
> +       start += start_pad;
> +       size = resource_size(res);
> +       npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - SZ_8K)
> +                                               / PAGE_SIZE);
> +
> +      /*
> +       * vmemmap_populate_hugepages() allocates the memmap array in
> +       * HPAGE_SIZE chunks.
> +       */
> +       offset = ALIGN(start + SZ_8K + 64 * npfns, HPAGE_SIZE) - start;
> +       vpmem->data_offset = offset;
> +
> +       struct vmem_altmap __altmap = {
> +               .base_pfn = init_altmap_base(start+start_pad),
> +               .reserve = init_altmap_reserve(start+start_pad),
> +       };
> +
> +       res->start += start_pad;
> +       res->end -= end_trunc;
> +       memcpy(altmap, &__altmap, sizeof(*altmap));
> +       altmap->free = PHYS_PFN(offset - SZ_8K);
> +       altmap->alloc = 0;
> +
> +       return altmap;
> +}
> +
> +static blk_status_t pmem_do_bvec(struct virtio_pmem *pmem, struct page *page,
> +                       unsigned int len, unsigned int off, bool is_write,
> +                       sector_t sector)
> +{
> +       blk_status_t rc = BLK_STS_OK;
> +       phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> +       void *pmem_addr = pmem->virt_addr + pmem_off;
> +
> +       if (!is_write) {
> +               rc = read_pmem(page, off, pmem_addr, len);
> +                       flush_dcache_page(page);
> +       } else {
> +               flush_dcache_page(page);
> +               write_pmem(pmem_addr, page, off, len);
> +       }
> +
> +       return rc;
> +}
> +
> +static int vpmem_rw_page(struct block_device *bdev, sector_t sector,
> +                      struct page *page, bool is_write)
> +{
> +       struct virtio_pmem  *pmem = bdev->bd_queue->queuedata;
> +       blk_status_t rc;
> +
> +       rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
> +                         0, is_write, sector);
> +
> +       if (rc == 0)
> +               page_endio(page, is_write, 0);
> +
> +       return blk_status_to_errno(rc);
> +}
> +
> +#ifndef REQ_FLUSH
> +#define REQ_FLUSH REQ_PREFLUSH
> +#endif
> +
> +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> +                       struct bio *bio)
> +{
> +       blk_status_t rc = 0;
> +       struct bio_vec bvec;
> +       struct bvec_iter iter;
> +       struct virtio_pmem *pmem = q->queuedata;
> +
> +       if (bio->bi_opf & REQ_FLUSH)
> +               //todo host flush command
> +
> +       bio_for_each_segment(bvec, bio, iter) {
> +               rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> +                               bvec.bv_offset, op_is_write(bio_op(bio)),
> +                               iter.bi_sector);
> +               if (rc) {
> +                       bio->bi_status = rc;
> +                       break;
> +               }
> +       }
> +
> +       bio_endio(bio);
> +       return BLK_QC_T_NONE;
> +}

Again, the above could be shared by both drivers.

> +
> +static const struct block_device_operations pmem_fops = {
> +       .owner =                THIS_MODULE,
> +       .rw_page =              vpmem_rw_page,
> +       //.revalidate_disk =    nvdimm_revalidate_disk,
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +       struct virtio_pmem *vpmem;
> +       int err = 0;
> +       void *addr;
> +       struct resource *res, res_pfn;
> +       struct request_queue *q;
> +       struct vmem_altmap __altmap, *altmap = NULL;
> +       struct gendisk *disk;
> +       struct device *gendev;
> +       int nid = dev_to_node(&vdev->dev);
> +
> +       if (!vdev->config->get) {
> +               dev_err(&vdev->dev, "%s failure: config disabled\n",
> +                       __func__);
> +               return -EINVAL;
> +       }
> +
> +       vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +                       GFP_KERNEL);
> +
> +       if (!vpmem) {
> +               err = -ENOMEM;
> +               goto out;
> +       }
> +
> +       dev_set_drvdata(&vdev->dev, vpmem);
> +
> +       vpmem->vdev = vdev;
> +       err = init_vq(vpmem);
> +       if (err)
> +               goto out;
> +
> +       if (!virtio_has_feature(vdev, VIRTIO_PMEM_PLUG)) {
> +               dev_err(&vdev->dev, "%s : pmem not supported\n",
> +                       __func__);
> +               goto out;
> +       }
> +
> +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +                       start, &vpmem->start);
> +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +                       size, &vpmem->size);
> +
> +       res_pfn.start = vpmem->start;
> +       res_pfn.end   = vpmem->start + vpmem->size-1;
> +
> +       /* used for allocating memmap in the pmem device */
> +       altmap        = setup_pmem_pfn(vpmem, &res_pfn, &__altmap);
> +
> +       res = devm_request_mem_region(&vdev->dev,
> +                       res_pfn.start, resource_size(&res_pfn), "virtio-pmem");
> +
> +       if (!res) {
> +               dev_warn(&vdev->dev, "could not reserve region ");
> +               return -EBUSY;
> +       }
> +
> +       q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(&vdev->dev));
> +
> +       if (!q)
> +               return -ENOMEM;
> +
> +       if (devm_add_action_or_reset(&vdev->dev,
> +                               virtio_pmem_release_queue, q))
> +               return -ENOMEM;
> +
> +       vpmem->pfn_flags = PFN_DEV;
> +
> +       /* allocate memap in pmem device itself */
> +       if (IS_ENABLED(CONFIG_ZONE_DEVICE)) {
> +
> +               addr = devm_memremap_pages(&vdev->dev, res,
> +                               &q->q_usage_counter, altmap);
> +               vpmem->pfn_flags |= PFN_MAP;
> +       } else
> +               addr = devm_memremap(&vdev->dev, vpmem->start,
> +                               vpmem->size, ARCH_MEMREMAP_PMEM);
> +
> +        /*
> +         * At release time the queue must be frozen before
> +         * devm_memremap_pages is unwound
> +         */
> +       if (devm_add_action_or_reset(&vdev->dev,
> +                               virtio_pmem_freeze_queue, q))
> +               return -ENOMEM;
> +
> +       if (IS_ERR(addr))
> +               return PTR_ERR(addr);
> +
> +       vpmem->virt_addr = addr;
> +       blk_queue_write_cache(q, 0, 0);
> +       blk_queue_make_request(q, virtio_pmem_make_request);
> +       blk_queue_physical_block_size(q, PAGE_SIZE);
> +       blk_queue_logical_block_size(q, 512);
> +       blk_queue_max_hw_sectors(q, UINT_MAX);
> +       queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +       queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
> +       q->queuedata = vpmem;
> +
> +       disk = alloc_disk_node(0, nid);
> +       if (!disk)
> +               return -ENOMEM;
> +       vpmem->disk = disk;
> +
> +       disk->fops                = &pmem_fops;
> +       disk->queue               = q;
> +       disk->flags               = GENHD_FL_EXT_DEVT;
> +       strcpy(disk->disk_name, "vpmem");
> +       set_capacity(disk, vpmem->size/512);
> +       gendev = disk_to_dev(disk);
> +
> +       virtio_device_ready(vdev);
> +       device_add_disk(&vdev->dev, disk);
> +
> +       if (devm_add_action_or_reset(&vdev->dev,
> +                       virtio_pmem_release_disk, vpmem))
> +               return -ENOMEM;
> +
> +       revalidate_disk(disk);
> +       return 0;
> +out:
> +       vdev->config->del_vqs(vdev);
> +       return err;
> +}

Here we have a mix of code that is common and some that is virtio
specific, the shared code should be factored out into a common helper
that both drivers call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 20:51     ` Dan Williams
@ 2017-10-12 21:25       ` Pankaj Gupta
  2017-10-12 21:54         ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 21:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Rik van Riel, Haozhong Zhang,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Ross Zwisler,
	David Hildenbrand, Xiao Guangrong


> >   This patch adds virtio-pmem driver for KVM guest.
> >   Guest reads the persistent memory range information
> >   over virtio bus from Qemu and reserves the range
> >   as persistent memory. Guest also allocates a block
> >   device corresponding to the pmem range which later
> >   can be accessed with DAX compatible file systems.
> >   Idea is to use the virtio channel between guest and
> >   host to perform the block device flush for guest pmem
> >   DAX device.
> >
> >   There is work to do including DAX file system support
> >   and other advanced features.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/virtio/Kconfig           |  10 ++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/virtio_pmem.c     | 322
> >  +++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> >  4 files changed, 388 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index cff773f15b7e..0192c4bda54b 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> >
> >           If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > +       tristate "Virtio pmem driver"
> > +       depends on VIRTIO
> > +       ---help---
> > +        This driver adds persistent memory range within a KVM guest.
> 
> I think we need to call this something other than persistent memory to
> make it clear that this not memory where the persistence can be
> managed from userspace. The persistence point always requires a driver
> call, so this is something distinctly different than "persistent
> memory". For example, it's a bug if this memory range ends up backing
> a device-dax range in the guest where there is no such thing as a
> driver callback to perform the flushing. How does this solution
> protect against that scenario?

yes, you are right we are not providing device_dax in this case so it should
be clear from name. Any suggestion for name?  

> 
> > +         It also associates a block device corresponding to the pmem
> > +        range.
> > +
> > +        If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> >         tristate "Virtio balloon driver"
> >         depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 41e30e3dc842..032ade725cc2 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -5,3 +5,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..74e47cae0e24
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,322 @@
> > +/*
> > + * virtio-pmem driver
> > + */
> > +
> > +#include <linux/virtio.h>
> > +#include <linux/swap.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/delay.h>
> > +#include <linux/slab.h>
> > +#include <linux/module.h>
> > +#include <linux/oom.h>
> > +#include <linux/wait.h>
> > +#include <linux/mm.h>
> > +#include <linux/mount.h>
> > +#include <linux/magic.h>
> > +#include <linux/virtio_pmem.h>
> > +
> > +void devm_vpmem_disable(struct device *dev, struct resource *res, void
> > *addr)
> > +{
> > +       devm_memunmap(dev, addr);
> > +       devm_release_mem_region(dev, res->start, resource_size(res));
> > +}
> > +
> > +static void pmem_flush_done(struct virtqueue *vq)
> > +{
> > +       return;
> > +};
> > +
> > +static void virtio_pmem_release_queue(void *q)
> > +{
> > +       blk_cleanup_queue(q);
> > +}
> > +
> > +static void virtio_pmem_freeze_queue(void *q)
> > +{
> > +       blk_freeze_queue_start(q);
> > +}
> > +
> > +static void virtio_pmem_release_disk(void *__pmem)
> > +{
> > +       struct virtio_pmem *pmem = __pmem;
> > +
> > +       del_gendisk(pmem->disk);
> > +       put_disk(pmem->disk);
> > +}
> 
> This code seems identical to the base pmem case, it should move to the
> shared code object.

Sure!
> 
> > +
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +       struct virtqueue *vq;
> > +
> > +       /* single vq */
> > +       vq = virtio_find_single_vq(vpmem->vdev, pmem_flush_done,
> > "flush_queue");
> > +
> > +       if (IS_ERR(vq))
> > +               return PTR_ERR(vq);
> > +
> > +       return 0;
> > +}
> > +
> > +static struct vmem_altmap *setup_pmem_pfn(struct virtio_pmem *vpmem,
> > +                       struct resource *res, struct vmem_altmap *altmap)
> > +{
> > +       u32 start_pad = 0, end_trunc = 0;
> > +       resource_size_t start, size;
> > +       unsigned long npfns;
> > +       phys_addr_t offset;
> > +
> > +       size = resource_size(res);
> > +       start = PHYS_SECTION_ALIGN_DOWN(res->start);
> > +
> > +       if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> > +               IORES_DESC_NONE) == REGION_MIXED) {
> > +
> > +               start = res->start;
> > +               start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
> > +       }
> > +       start = res->start;
> > +       size = PHYS_SECTION_ALIGN_UP(start + size) - start;
> > +
> > +       if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> > +               IORES_DESC_NONE) == REGION_MIXED) {
> > +
> > +               size = resource_size(res);
> > +               end_trunc = start + size -
> > +                               PHYS_SECTION_ALIGN_DOWN(start + size);
> > +       }
> > +
> > +       start += start_pad;
> > +       size = resource_size(res);
> > +       npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - SZ_8K)
> > +                                               / PAGE_SIZE);
> > +
> > +      /*
> > +       * vmemmap_populate_hugepages() allocates the memmap array in
> > +       * HPAGE_SIZE chunks.
> > +       */
> > +       offset = ALIGN(start + SZ_8K + 64 * npfns, HPAGE_SIZE) - start;
> > +       vpmem->data_offset = offset;
> > +
> > +       struct vmem_altmap __altmap = {
> > +               .base_pfn = init_altmap_base(start+start_pad),
> > +               .reserve = init_altmap_reserve(start+start_pad),
> > +       };
> > +
> > +       res->start += start_pad;
> > +       res->end -= end_trunc;
> > +       memcpy(altmap, &__altmap, sizeof(*altmap));
> > +       altmap->free = PHYS_PFN(offset - SZ_8K);
> > +       altmap->alloc = 0;
> > +
> > +       return altmap;
> > +}
> > +
> > +static blk_status_t pmem_do_bvec(struct virtio_pmem *pmem, struct page
> > *page,
> > +                       unsigned int len, unsigned int off, bool is_write,
> > +                       sector_t sector)
> > +{
> > +       blk_status_t rc = BLK_STS_OK;
> > +       phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> > +       void *pmem_addr = pmem->virt_addr + pmem_off;
> > +
> > +       if (!is_write) {
> > +               rc = read_pmem(page, off, pmem_addr, len);
> > +                       flush_dcache_page(page);
> > +       } else {
> > +               flush_dcache_page(page);
> > +               write_pmem(pmem_addr, page, off, len);
> > +       }
> > +
> > +       return rc;
> > +}
> > +
> > +static int vpmem_rw_page(struct block_device *bdev, sector_t sector,
> > +                      struct page *page, bool is_write)
> > +{
> > +       struct virtio_pmem  *pmem = bdev->bd_queue->queuedata;
> > +       blk_status_t rc;
> > +
> > +       rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
> > +                         0, is_write, sector);
> > +
> > +       if (rc == 0)
> > +               page_endio(page, is_write, 0);
> > +
> > +       return blk_status_to_errno(rc);
> > +}
> > +
> > +#ifndef REQ_FLUSH
> > +#define REQ_FLUSH REQ_PREFLUSH
> > +#endif
> > +
> > +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> > +                       struct bio *bio)
> > +{
> > +       blk_status_t rc = 0;
> > +       struct bio_vec bvec;
> > +       struct bvec_iter iter;
> > +       struct virtio_pmem *pmem = q->queuedata;
> > +
> > +       if (bio->bi_opf & REQ_FLUSH)
> > +               //todo host flush command
> > +
> > +       bio_for_each_segment(bvec, bio, iter) {
> > +               rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> > +                               bvec.bv_offset, op_is_write(bio_op(bio)),
> > +                               iter.bi_sector);
> > +               if (rc) {
> > +                       bio->bi_status = rc;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       bio_endio(bio);
> > +       return BLK_QC_T_NONE;
> > +}
> 
> Again, the above could be shared by both drivers.

yes, I will do that.
> 
> > +
> > +static const struct block_device_operations pmem_fops = {
> > +       .owner =                THIS_MODULE,
> > +       .rw_page =              vpmem_rw_page,
> > +       //.revalidate_disk =    nvdimm_revalidate_disk,
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +       struct virtio_pmem *vpmem;
> > +       int err = 0;
> > +       void *addr;
> > +       struct resource *res, res_pfn;
> > +       struct request_queue *q;
> > +       struct vmem_altmap __altmap, *altmap = NULL;
> > +       struct gendisk *disk;
> > +       struct device *gendev;
> > +       int nid = dev_to_node(&vdev->dev);
> > +
> > +       if (!vdev->config->get) {
> > +               dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +                       __func__);
> > +               return -EINVAL;
> > +       }
> > +
> > +       vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +                       GFP_KERNEL);
> > +
> > +       if (!vpmem) {
> > +               err = -ENOMEM;
> > +               goto out;
> > +       }
> > +
> > +       dev_set_drvdata(&vdev->dev, vpmem);
> > +
> > +       vpmem->vdev = vdev;
> > +       err = init_vq(vpmem);
> > +       if (err)
> > +               goto out;
> > +
> > +       if (!virtio_has_feature(vdev, VIRTIO_PMEM_PLUG)) {
> > +               dev_err(&vdev->dev, "%s : pmem not supported\n",
> > +                       __func__);
> > +               goto out;
> > +       }
> > +
> > +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +                       start, &vpmem->start);
> > +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +                       size, &vpmem->size);
> > +
> > +       res_pfn.start = vpmem->start;
> > +       res_pfn.end   = vpmem->start + vpmem->size-1;
> > +
> > +       /* used for allocating memmap in the pmem device */
> > +       altmap        = setup_pmem_pfn(vpmem, &res_pfn, &__altmap);
> > +
> > +       res = devm_request_mem_region(&vdev->dev,
> > +                       res_pfn.start, resource_size(&res_pfn),
> > "virtio-pmem");
> > +
> > +       if (!res) {
> > +               dev_warn(&vdev->dev, "could not reserve region ");
> > +               return -EBUSY;
> > +       }
> > +
> > +       q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(&vdev->dev));
> > +
> > +       if (!q)
> > +               return -ENOMEM;
> > +
> > +       if (devm_add_action_or_reset(&vdev->dev,
> > +                               virtio_pmem_release_queue, q))
> > +               return -ENOMEM;
> > +
> > +       vpmem->pfn_flags = PFN_DEV;
> > +
> > +       /* allocate memap in pmem device itself */
> > +       if (IS_ENABLED(CONFIG_ZONE_DEVICE)) {
> > +
> > +               addr = devm_memremap_pages(&vdev->dev, res,
> > +                               &q->q_usage_counter, altmap);
> > +               vpmem->pfn_flags |= PFN_MAP;
> > +       } else
> > +               addr = devm_memremap(&vdev->dev, vpmem->start,
> > +                               vpmem->size, ARCH_MEMREMAP_PMEM);
> > +
> > +        /*
> > +         * At release time the queue must be frozen before
> > +         * devm_memremap_pages is unwound
> > +         */
> > +       if (devm_add_action_or_reset(&vdev->dev,
> > +                               virtio_pmem_freeze_queue, q))
> > +               return -ENOMEM;
> > +
> > +       if (IS_ERR(addr))
> > +               return PTR_ERR(addr);
> > +
> > +       vpmem->virt_addr = addr;
> > +       blk_queue_write_cache(q, 0, 0);
> > +       blk_queue_make_request(q, virtio_pmem_make_request);
> > +       blk_queue_physical_block_size(q, PAGE_SIZE);
> > +       blk_queue_logical_block_size(q, 512);
> > +       blk_queue_max_hw_sectors(q, UINT_MAX);
> > +       queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> > +       queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
> > +       q->queuedata = vpmem;
> > +
> > +       disk = alloc_disk_node(0, nid);
> > +       if (!disk)
> > +               return -ENOMEM;
> > +       vpmem->disk = disk;
> > +
> > +       disk->fops                = &pmem_fops;
> > +       disk->queue               = q;
> > +       disk->flags               = GENHD_FL_EXT_DEVT;
> > +       strcpy(disk->disk_name, "vpmem");
> > +       set_capacity(disk, vpmem->size/512);
> > +       gendev = disk_to_dev(disk);
> > +
> > +       virtio_device_ready(vdev);
> > +       device_add_disk(&vdev->dev, disk);
> > +
> > +       if (devm_add_action_or_reset(&vdev->dev,
> > +                       virtio_pmem_release_disk, vpmem))
> > +               return -ENOMEM;
> > +
> > +       revalidate_disk(disk);
> > +       return 0;
> > +out:
> > +       vdev->config->del_vqs(vdev);
> > +       return err;
> > +}
> 
> Here we have a mix of code that is common and some that is virtio
> specific, the shared code should be factored out into a common helper
> that both drivers call.

yes, i will factor this as well.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pmem: Move reusable code to base header files
  2017-10-12 20:42   ` Dan Williams
@ 2017-10-12 21:27     ` Pankaj Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 21:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Haozhong Zhang, Jan Kara, Xiao Guangrong, KVM list,
	David Hildenbrand, linux-nvdimm, Ross Zwisler, linux-kernel,
	Qemu Developers, Linux MM, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal


> 
> On Thu, Oct 12, 2017 at 8:50 AM, Pankaj Gupta <pagupta@redhat.com> wrote:
> >  This patch moves common code to base header files
> >  so that it can be used for both ACPI pmem and VIRTIO pmem
> >  drivers. More common code needs to be moved out in future
> >  based on functionality required for virtio_pmem driver and
> >  coupling of code with existing ACPI pmem driver.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> [..]
> > diff --git a/include/linux/pmem_common.h b/include/linux/pmem_common.h
> > new file mode 100644
> > index 000000000000..e2e718c74b3f
> > --- /dev/null
> > +++ b/include/linux/pmem_common.h
> 
> This should be a common C file, not a header.

Sure! will create a common C file to put all the common code there.

> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 21:25       ` Pankaj Gupta
@ 2017-10-12 21:54         ` Dan Williams
       [not found]           ` <CAPcyv4gkri7t+3Unf0sc9AHMnz-v9G_qV_bJppLjUUNAn7drrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-12 21:54 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Rik van Riel, Haozhong Zhang,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Ross Zwisler,
	David Hildenbrand, Xiao Guangrong

On Thu, Oct 12, 2017 at 2:25 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
>
>> >   This patch adds virtio-pmem driver for KVM guest.
>> >   Guest reads the persistent memory range information
>> >   over virtio bus from Qemu and reserves the range
>> >   as persistent memory. Guest also allocates a block
>> >   device corresponding to the pmem range which later
>> >   can be accessed with DAX compatible file systems.
>> >   Idea is to use the virtio channel between guest and
>> >   host to perform the block device flush for guest pmem
>> >   DAX device.
>> >
>> >   There is work to do including DAX file system support
>> >   and other advanced features.
>> >
>> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
>> > ---
>> >  drivers/virtio/Kconfig           |  10 ++
>> >  drivers/virtio/Makefile          |   1 +
>> >  drivers/virtio/virtio_pmem.c     | 322
>> >  +++++++++++++++++++++++++++++++++++++++
>> >  include/uapi/linux/virtio_pmem.h |  55 +++++++
>> >  4 files changed, 388 insertions(+)
>> >  create mode 100644 drivers/virtio/virtio_pmem.c
>> >  create mode 100644 include/uapi/linux/virtio_pmem.h
>> >
>> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> > index cff773f15b7e..0192c4bda54b 100644
>> > --- a/drivers/virtio/Kconfig
>> > +++ b/drivers/virtio/Kconfig
>> > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
>> >
>> >           If unsure, say Y.
>> >
>> > +config VIRTIO_PMEM
>> > +       tristate "Virtio pmem driver"
>> > +       depends on VIRTIO
>> > +       ---help---
>> > +        This driver adds persistent memory range within a KVM guest.
>>
>> I think we need to call this something other than persistent memory to
>> make it clear that this not memory where the persistence can be
>> managed from userspace. The persistence point always requires a driver
>> call, so this is something distinctly different than "persistent
>> memory". For example, it's a bug if this memory range ends up backing
>> a device-dax range in the guest where there is no such thing as a
>> driver callback to perform the flushing. How does this solution
>> protect against that scenario?
>
> yes, you are right we are not providing device_dax in this case so it should
> be clear from name. Any suggestion for name?

So currently /proc/iomem in a guest with a pmem device attached to a
namespace looks like this:

    c00000000-13bfffffff : Persistent Memory
       c00000000-13bfffffff : namespace2.0

Can we call it "Virtio Shared Memory" to make it clear it is a
different beast than typical "Persistent Memory"?  You can likely
inject your own name into the resource tree the same way we do in the
NFIT driver. See acpi_nfit_insert_resource().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
       [not found]           ` <CAPcyv4gkri7t+3Unf0sc9AHMnz-v9G_qV_bJppLjUUNAn7drrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-12 22:18             ` Pankaj Gupta
  2017-10-12 22:27               ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 22:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Rik van Riel, Jan Kara, Xiao Guangrong, KVM list,
	David Hildenbrand, linux-nvdimm, Ross Zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers, Linux MM,
	Stefan Hajnoczi, Paolo Bonzini, Nitesh Narayan Lal


> 
> On Thu, Oct 12, 2017 at 2:25 PM, Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >> >   This patch adds virtio-pmem driver for KVM guest.
> >> >   Guest reads the persistent memory range information
> >> >   over virtio bus from Qemu and reserves the range
> >> >   as persistent memory. Guest also allocates a block
> >> >   device corresponding to the pmem range which later
> >> >   can be accessed with DAX compatible file systems.
> >> >   Idea is to use the virtio channel between guest and
> >> >   host to perform the block device flush for guest pmem
> >> >   DAX device.
> >> >
> >> >   There is work to do including DAX file system support
> >> >   and other advanced features.
> >> >
> >> > Signed-off-by: Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> > ---
> >> >  drivers/virtio/Kconfig           |  10 ++
> >> >  drivers/virtio/Makefile          |   1 +
> >> >  drivers/virtio/virtio_pmem.c     | 322
> >> >  +++++++++++++++++++++++++++++++++++++++
> >> >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> >> >  4 files changed, 388 insertions(+)
> >> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> >> >
> >> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> >> > index cff773f15b7e..0192c4bda54b 100644
> >> > --- a/drivers/virtio/Kconfig
> >> > +++ b/drivers/virtio/Kconfig
> >> > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> >> >
> >> >           If unsure, say Y.
> >> >
> >> > +config VIRTIO_PMEM
> >> > +       tristate "Virtio pmem driver"
> >> > +       depends on VIRTIO
> >> > +       ---help---
> >> > +        This driver adds persistent memory range within a KVM guest.
> >>
> >> I think we need to call this something other than persistent memory to
> >> make it clear that this not memory where the persistence can be
> >> managed from userspace. The persistence point always requires a driver
> >> call, so this is something distinctly different than "persistent
> >> memory". For example, it's a bug if this memory range ends up backing
> >> a device-dax range in the guest where there is no such thing as a
> >> driver callback to perform the flushing. How does this solution
> >> protect against that scenario?
> >
> > yes, you are right we are not providing device_dax in this case so it
> > should
> > be clear from name. Any suggestion for name?
> 
> So currently /proc/iomem in a guest with a pmem device attached to a
> namespace looks like this:
> 
>     c00000000-13bfffffff : Persistent Memory
>        c00000000-13bfffffff : namespace2.0
> 
> Can we call it "Virtio Shared Memory" to make it clear it is a
> different beast than typical "Persistent Memory"?  You can likely

I think somewhere we need persistent keyword 'Virtio Persistent Memory' or 
so.

> inject your own name into the resource tree the same way we do in the
> NFIT driver. See acpi_nfit_insert_resource().

Sure! thank you.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 22:18             ` Pankaj Gupta
@ 2017-10-12 22:27               ` Rik van Riel
  2017-10-12 22:39                 ` Pankaj Gupta
  2017-10-12 22:52                 ` Pankaj Gupta
  0 siblings, 2 replies; 34+ messages in thread
From: Rik van Riel @ 2017-10-12 22:27 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Haozhong Zhang, Nitesh Narayan Lal,
	Kevin Wolf, Paolo Bonzini, Ross Zwisler, David Hildenbrand,
	Xiao Guangrong

On Thu, 2017-10-12 at 18:18 -0400, Pankaj Gupta wrote:
> > 
> > On Thu, Oct 12, 2017 at 2:25 PM, Pankaj Gupta <pagupta@redhat.com>
> > wrote:
> > > 
> > > > >   This patch adds virtio-pmem driver for KVM guest.
> > > > >   Guest reads the persistent memory range information
> > > > >   over virtio bus from Qemu and reserves the range
> > > > >   as persistent memory. Guest also allocates a block
> > > > >   device corresponding to the pmem range which later
> > > > >   can be accessed with DAX compatible file systems.
> > > > >   Idea is to use the virtio channel between guest and
> > > > >   host to perform the block device flush for guest pmem
> > > > >   DAX device.
> > > > > 
> > > > >   There is work to do including DAX file system support
> > > > >   and other advanced features.
> > > > > 
> > > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > > > ---
> > > > >  drivers/virtio/Kconfig           |  10 ++
> > > > >  drivers/virtio/Makefile          |   1 +
> > > > >  drivers/virtio/virtio_pmem.c     | 322
> > > > >  +++++++++++++++++++++++++++++++++++++++
> > > > >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> > > > >  4 files changed, 388 insertions(+)
> > > > >  create mode 100644 drivers/virtio/virtio_pmem.c
> > > > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > > > 
> > > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > > index cff773f15b7e..0192c4bda54b 100644
> > > > > --- a/drivers/virtio/Kconfig
> > > > > +++ b/drivers/virtio/Kconfig
> > > > > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> > > > > 
> > > > >           If unsure, say Y.
> > > > > 
> > > > > +config VIRTIO_PMEM
> > > > > +       tristate "Virtio pmem driver"
> > > > > +       depends on VIRTIO
> > > > > +       ---help---
> > > > > +        This driver adds persistent memory range within a
> > > > > KVM guest.

With "Virtio Block Backed Pmem" we could name the config
option VIRTIO_BLOCK_PMEM

The documentation text could make it clear to people that the
image shows up as a disk image on the host, but as a pmem
memory range in the guest.

> > > > I think we need to call this something other than persistent
> > > > memory to
> > > > make it clear that this not memory where the persistence can be
> > > > managed from userspace. The persistence point always requires
> > > > 
> > So currently /proc/iomem in a guest with a pmem device attached to
> > a
> > namespace looks like this:
> > 
> >     c00000000-13bfffffff : Persistent Memory
> >        c00000000-13bfffffff : namespace2.0
> > 
> > Can we call it "Virtio Shared Memory" to make it clear it is a
> > different beast than typical "Persistent Memory"?  You can likely
> 
> I think somewhere we need persistent keyword 'Virtio Persistent
> Memory' or 
> so.

Still hoping for better ideas than "Virtio Block Backed Pmem" :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 22:27               ` Rik van Riel
@ 2017-10-12 22:39                 ` Pankaj Gupta
  2017-10-12 22:52                 ` Pankaj Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 22:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dan Williams, linux-kernel, KVM list, Qemu Developers,
	linux-nvdimm, Linux MM, Jan Kara, Stefan Hajnoczi,
	Haozhong Zhang, Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini,
	Ross Zwisler, David Hildenbrand, Xiao Guangrong

> > > > 
> > > > > >   This patch adds virtio-pmem driver for KVM guest.
> > > > > >   Guest reads the persistent memory range information
> > > > > >   over virtio bus from Qemu and reserves the range
> > > > > >   as persistent memory. Guest also allocates a block
> > > > > >   device corresponding to the pmem range which later
> > > > > >   can be accessed with DAX compatible file systems.
> > > > > >   Idea is to use the virtio channel between guest and
> > > > > >   host to perform the block device flush for guest pmem
> > > > > >   DAX device.
> > > > > > 
> > > > > >   There is work to do including DAX file system support
> > > > > >   and other advanced features.
> > > > > > 
> > > > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > > > > ---
> > > > > >  drivers/virtio/Kconfig           |  10 ++
> > > > > >  drivers/virtio/Makefile          |   1 +
> > > > > >  drivers/virtio/virtio_pmem.c     | 322
> > > > > >  +++++++++++++++++++++++++++++++++++++++
> > > > > >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> > > > > >  4 files changed, 388 insertions(+)
> > > > > >  create mode 100644 drivers/virtio/virtio_pmem.c
> > > > > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > > > > 
> > > > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > > > index cff773f15b7e..0192c4bda54b 100644
> > > > > > --- a/drivers/virtio/Kconfig
> > > > > > +++ b/drivers/virtio/Kconfig
> > > > > > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> > > > > > 
> > > > > >           If unsure, say Y.
> > > > > > 
> > > > > > +config VIRTIO_PMEM
> > > > > > +       tristate "Virtio pmem driver"
> > > > > > +       depends on VIRTIO
> > > > > > +       ---help---
> > > > > > +        This driver adds persistent memory range within a
> > > > > > KVM guest.
> 
> With "Virtio Block Backed Pmem" we could name the config
> option VIRTIO_BLOCK_PMEM
> 
> The documentation text could make it clear to people that the
> image shows up as a disk image on the host, but as a pmem
> memory range in the guest.

yes, this looks better. 
thank you.

> 
> > > > > I think we need to call this something other than persistent
> > > > > memory to
> > > > > make it clear that this not memory where the persistence can be
> > > > > managed from userspace. The persistence point always requires
> > > > > 
> > > So currently /proc/iomem in a guest with a pmem device attached to
> > > a
> > > namespace looks like this:
> > > 
> > >     c00000000-13bfffffff : Persistent Memory
> > >        c00000000-13bfffffff : namespace2.0
> > > 
> > > Can we call it "Virtio Shared Memory" to make it clear it is a
> > > different beast than typical "Persistent Memory"?  You can likely
> > 
> > I think somewhere we need persistent keyword 'Virtio Persistent
> > Memory' or
> > so.
> 
> Still hoping for better ideas than "Virtio Block Backed Pmem" :)

:-)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 22:27               ` Rik van Riel
  2017-10-12 22:39                 ` Pankaj Gupta
@ 2017-10-12 22:52                 ` Pankaj Gupta
  2017-10-12 22:59                   ` Dan Williams
  1 sibling, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 22:52 UTC (permalink / raw)
  To: Dan Williams, Rik van Riel
  Cc: linux-kernel, KVM list, Qemu Developers, linux-nvdimm, Linux MM,
	Jan Kara, Stefan Hajnoczi, Haozhong Zhang, Nitesh Narayan Lal,
	Kevin Wolf, Paolo Bonzini, Ross Zwisler, David Hildenbrand,
	Xiao Guangrong


> > > wrote:
> > > > 
> > > > > >   This patch adds virtio-pmem driver for KVM guest.
> > > > > >   Guest reads the persistent memory range information
> > > > > >   over virtio bus from Qemu and reserves the range
> > > > > >   as persistent memory. Guest also allocates a block
> > > > > >   device corresponding to the pmem range which later
> > > > > >   can be accessed with DAX compatible file systems.
> > > > > >   Idea is to use the virtio channel between guest and
> > > > > >   host to perform the block device flush for guest pmem
> > > > > >   DAX device.
> > > > > > 
> > > > > >   There is work to do including DAX file system support
> > > > > >   and other advanced features.
> > > > > > 
> > > > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > > > > ---
> > > > > >  drivers/virtio/Kconfig           |  10 ++
> > > > > >  drivers/virtio/Makefile          |   1 +
> > > > > >  drivers/virtio/virtio_pmem.c     | 322
> > > > > >  +++++++++++++++++++++++++++++++++++++++
> > > > > >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> > > > > >  4 files changed, 388 insertions(+)
> > > > > >  create mode 100644 drivers/virtio/virtio_pmem.c
> > > > > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > > > > 
> > > > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > > > index cff773f15b7e..0192c4bda54b 100644
> > > > > > --- a/drivers/virtio/Kconfig
> > > > > > +++ b/drivers/virtio/Kconfig
> > > > > > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> > > > > > 
> > > > > >           If unsure, say Y.
> > > > > > 
> > > > > > +config VIRTIO_PMEM
> > > > > > +       tristate "Virtio pmem driver"
> > > > > > +       depends on VIRTIO
> > > > > > +       ---help---
> > > > > > +        This driver adds persistent memory range within a
> > > > > > KVM guest.
> 
> With "Virtio Block Backed Pmem" we could name the config
> option VIRTIO_BLOCK_PMEM
> 
> The documentation text could make it clear to people that the
> image shows up as a disk image on the host, but as a pmem
> memory range in the guest.
> 
> > > > > I think we need to call this something other than persistent
> > > > > memory to
> > > > > make it clear that this not memory where the persistence can be
> > > > > managed from userspace. The persistence point always requires
> > > > > 
> > > So currently /proc/iomem in a guest with a pmem device attached to
> > > a
> > > namespace looks like this:
> > > 
> > >     c00000000-13bfffffff : Persistent Memory
> > >        c00000000-13bfffffff : namespace2.0
> > > 
> > > Can we call it "Virtio Shared Memory" to make it clear it is a
> > > different beast than typical "Persistent Memory"?  You can likely
> > 
> > I think somewhere we need persistent keyword 'Virtio Persistent
> > Memory' or
> > so.
> 
> Still hoping for better ideas than "Virtio Block Backed Pmem" :)
> 

Dan,

I have a query regarding below patch [*]. My assumption is its halted
because of memory hotplug restructuring work? Anything I am missing 
here?

[*] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 22:52                 ` Pankaj Gupta
@ 2017-10-12 22:59                   ` Dan Williams
  2017-10-12 23:07                     ` Pankaj Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-12 22:59 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Rik van Riel, linux-kernel, KVM list, Qemu Developers,
	linux-nvdimm, Linux MM, Jan Kara, Stefan Hajnoczi,
	Haozhong Zhang, Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini,
	Ross Zwisler, David Hildenbrand, Xiao Guangrong

On Thu, Oct 12, 2017 at 3:52 PM, Pankaj Gupta <pagupta@redhat.com> wrote:
> Dan,
>
> I have a query regarding below patch [*]. My assumption is its halted
> because of memory hotplug restructuring work? Anything I am missing
> here?
>
> [*] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html

It's fallen to the back of my queue since the original driving need of
platform firmware changing offsets from boot-to-boot is no longer an
issue. However, it does mean that you need to arrange for 128MB
aligned devm_memremap_pages() ranges for the foreseeable future.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 22:59                   ` Dan Williams
@ 2017-10-12 23:07                     ` Pankaj Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-12 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rik van Riel, linux-kernel, KVM list, Qemu Developers,
	linux-nvdimm, Linux MM, Jan Kara, Stefan Hajnoczi,
	Haozhong Zhang, Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini,
	Ross Zwisler, David Hildenbrand, Xiao Guangrong


> > Dan,
> >
> > I have a query regarding below patch [*]. My assumption is its halted
> > because of memory hotplug restructuring work? Anything I am missing
> > here?
> >
> > [*] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html
> 
> It's fallen to the back of my queue since the original driving need of
> platform firmware changing offsets from boot-to-boot is no longer an
> issue. However, it does mean that you need to arrange for 128MB
> aligned devm_memremap_pages() ranges for the foreseeable future.

o.k I will provide 128MB aligned pages to devm_memremap_pages() function.

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-12 15:50   ` [RFC 2/2] KVM: add virtio-pmem driver Pankaj Gupta
  2017-10-12 20:51     ` Dan Williams
@ 2017-10-13  9:44     ` Stefan Hajnoczi
  2017-10-13 10:48       ` Pankaj Gupta
       [not found]       ` <20171013094431.GA27308-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
       [not found]     ` <20171012155027.3277-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 2 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2017-10-13  9:44 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm, jack,
	stefanha, dan.j.williams, riel, haozhong.zhang, nilal, kwolf,
	pbonzini, ross.zwisler, david, xiaoguangrong.eric

On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
>   This patch adds virtio-pmem driver for KVM guest.
>   Guest reads the persistent memory range information
>   over virtio bus from Qemu and reserves the range
>   as persistent memory. Guest also allocates a block
>   device corresponding to the pmem range which later
>   can be accessed with DAX compatible file systems.
>   Idea is to use the virtio channel between guest and
>   host to perform the block device flush for guest pmem 
>   DAX device.
> 
>   There is work to do including DAX file system support 
>   and other advanced features.
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/virtio/Kconfig           |  10 ++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/virtio_pmem.c     | 322 +++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_pmem.h |  55 +++++++
>  4 files changed, 388 insertions(+)
>  create mode 100644 drivers/virtio/virtio_pmem.c
>  create mode 100644 include/uapi/linux/virtio_pmem.h
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index cff773f15b7e..0192c4bda54b 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
>  
>  	  If unsure, say Y.
>  
> +config VIRTIO_PMEM
> +	tristate "Virtio pmem driver"
> +	depends on VIRTIO
> +	---help---
> +	 This driver adds persistent memory range within a KVM guest.
> +         It also associates a block device corresponding to the pmem
> +	 range.
> +
> +	 If unsure, say M.
> +
>  config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 41e30e3dc842..032ade725cc2 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -5,3 +5,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 000000000000..74e47cae0e24
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,322 @@
> +/*
> + * virtio-pmem driver
> + */
> +
> +#include <linux/virtio.h>
> +#include <linux/swap.h>
> +#include <linux/workqueue.h>
> +#include <linux/delay.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/oom.h>
> +#include <linux/wait.h>
> +#include <linux/mm.h>
> +#include <linux/mount.h>
> +#include <linux/magic.h>
> +#include <linux/virtio_pmem.h>
> +
> +void devm_vpmem_disable(struct device *dev, struct resource *res, void *addr)
> +{
> +	devm_memunmap(dev, addr);
> +	devm_release_mem_region(dev, res->start, resource_size(res));
> +}
> +
> +static void pmem_flush_done(struct virtqueue *vq)
> +{
> +	return;
> +};
> +
> +static void virtio_pmem_release_queue(void *q)
> +{
> +	blk_cleanup_queue(q);
> +}
> +
> +static void virtio_pmem_freeze_queue(void *q)
> +{
> +	blk_freeze_queue_start(q);
> +}
> +
> +static void virtio_pmem_release_disk(void *__pmem)
> +{
> +	struct virtio_pmem *pmem = __pmem;
> +
> +	del_gendisk(pmem->disk);
> +	put_disk(pmem->disk);
> +}
> +
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> +	struct virtqueue *vq;
> +
> +	/* single vq */
> +	vq = virtio_find_single_vq(vpmem->vdev, pmem_flush_done, "flush_queue");
> +
> +	if (IS_ERR(vq))
> +		return PTR_ERR(vq);
> +
> +	return 0;
> +}
> +
> +static struct vmem_altmap *setup_pmem_pfn(struct virtio_pmem *vpmem,
> +			struct resource *res, struct vmem_altmap *altmap)
> +{
> +	u32 start_pad = 0, end_trunc = 0;
> +	resource_size_t start, size;
> +	unsigned long npfns;
> +	phys_addr_t offset;
> +
> +	size = resource_size(res);
> +	start = PHYS_SECTION_ALIGN_DOWN(res->start);
> +
> +	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> +		IORES_DESC_NONE) == REGION_MIXED) {
> +
> +		start = res->start;
> +		start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
> +	}
> +	start = res->start;
> +	size = PHYS_SECTION_ALIGN_UP(start + size) - start;
> +
> +	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> +		IORES_DESC_NONE) == REGION_MIXED) {
> +
> +		size = resource_size(res);
> +		end_trunc = start + size -
> +				PHYS_SECTION_ALIGN_DOWN(start + size);
> +	}
> +
> +	start += start_pad;
> +	size = resource_size(res);
> +	npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - SZ_8K)
> +						/ PAGE_SIZE);
> +
> +      /*
> +       * vmemmap_populate_hugepages() allocates the memmap array in
> +       * HPAGE_SIZE chunks.
> +       */
> +	offset = ALIGN(start + SZ_8K + 64 * npfns, HPAGE_SIZE) - start;
> +	vpmem->data_offset = offset;
> +
> +	struct vmem_altmap __altmap = {
> +		.base_pfn = init_altmap_base(start+start_pad),
> +		.reserve = init_altmap_reserve(start+start_pad),
> +	};
> +
> +	res->start += start_pad;
> +	res->end -= end_trunc;
> +	memcpy(altmap, &__altmap, sizeof(*altmap));
> +	altmap->free = PHYS_PFN(offset - SZ_8K);
> +	altmap->alloc = 0;

The __altmap part can be rewritten using compound literal syntax:

  *altmap = (struct vmem_altmap) {
      .base_pfn = init_altmap_base(start+start_pad),
      .reserve = init_altmap_reserve(start+start_pad),
      .free = PHYS_PFN(offset - SZ_8K),
  };

This eliminates the temporary variable, memcpy, and explicit alloc = 0
initialization.

> +
> +	return altmap;
> +}
> +
> +static blk_status_t pmem_do_bvec(struct virtio_pmem *pmem, struct page *page,
> +			unsigned int len, unsigned int off, bool is_write,
> +			sector_t sector)
> +{
> +	blk_status_t rc = BLK_STS_OK;
> +	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> +	void *pmem_addr = pmem->virt_addr + pmem_off;
> +
> +	if (!is_write) {
> +		rc = read_pmem(page, off, pmem_addr, len);
> +			flush_dcache_page(page);
> +	} else {
> +		flush_dcache_page(page);

What are the semantics of this device?  Why flush the dcache here if an
explicit flush command needs to be sent over the virtqueue?

> +		write_pmem(pmem_addr, page, off, len);
> +	}
> +
> +	return rc;
> +}
> +
> +static int vpmem_rw_page(struct block_device *bdev, sector_t sector,
> +		       struct page *page, bool is_write)
> +{
> +	struct virtio_pmem  *pmem = bdev->bd_queue->queuedata;
> +	blk_status_t rc;
> +
> +	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
> +			  0, is_write, sector);
> +
> +	if (rc == 0)
> +		page_endio(page, is_write, 0);
> +
> +	return blk_status_to_errno(rc);
> +}
> +
> +#ifndef REQ_FLUSH
> +#define REQ_FLUSH REQ_PREFLUSH
> +#endif

Is this out-of-tree kernel module compatibility stuff that can be
removed?

> +
> +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> +			struct bio *bio)
> +{
> +	blk_status_t rc = 0;
> +	struct bio_vec bvec;
> +	struct bvec_iter iter;
> +	struct virtio_pmem *pmem = q->queuedata;
> +
> +	if (bio->bi_opf & REQ_FLUSH)
> +		//todo host flush command

This detail is critical to the device design.  What is the plan?

> +
> +	bio_for_each_segment(bvec, bio, iter) {
> +		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> +				bvec.bv_offset, op_is_write(bio_op(bio)),
> +				iter.bi_sector);
> +		if (rc) {
> +			bio->bi_status = rc;
> +			break;
> +		}
> +	}
> +
> +	bio_endio(bio);
> +	return BLK_QC_T_NONE;
> +}
> +
> +static const struct block_device_operations pmem_fops = {
> +	.owner =		THIS_MODULE,
> +	.rw_page =		vpmem_rw_page,
> +	//.revalidate_disk =	nvdimm_revalidate_disk,
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +	struct virtio_pmem *vpmem;
> +	int err = 0;
> +	void *addr;
> +	struct resource *res, res_pfn;
> +	struct request_queue *q;
> +	struct vmem_altmap __altmap, *altmap = NULL;
> +	struct gendisk *disk;
> +	struct device *gendev;
> +	int nid = dev_to_node(&vdev->dev);
> +
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +			GFP_KERNEL);
> +
> +	if (!vpmem) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dev_set_drvdata(&vdev->dev, vpmem);
> +
> +	vpmem->vdev = vdev;
> +	err = init_vq(vpmem);
> +	if (err)
> +		goto out;
> +
> +	if (!virtio_has_feature(vdev, VIRTIO_PMEM_PLUG)) {
> +		dev_err(&vdev->dev, "%s : pmem not supported\n",
> +			__func__);
> +		goto out;
> +	}

Why is VIRTIO_PMEM_PLUG optional for this new device type if it's
already required by the first version of the driver?

err is not set when the error path is taken.

> +
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			start, &vpmem->start);
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			size, &vpmem->size);

The struct resource start and size fields can vary between
architectures.  Virtio device configuration space layout must not vary
between architectures.  Please use u64.

> +
> +	res_pfn.start = vpmem->start;
> +	res_pfn.end   = vpmem->start + vpmem->size-1;
> +
> +	/* used for allocating memmap in the pmem device */
> +	altmap	      = setup_pmem_pfn(vpmem, &res_pfn, &__altmap);
> +
> +	res = devm_request_mem_region(&vdev->dev,
> +			res_pfn.start, resource_size(&res_pfn), "virtio-pmem");
> +
> +	if (!res) {
> +		dev_warn(&vdev->dev, "could not reserve region ");
> +		return -EBUSY;
> +	}
> +
> +	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(&vdev->dev));
> +
> +	if (!q)
> +		return -ENOMEM;
> +
> +	if (devm_add_action_or_reset(&vdev->dev,
> +				virtio_pmem_release_queue, q))
> +		return -ENOMEM;
> +
> +	vpmem->pfn_flags = PFN_DEV;
> +
> +	/* allocate memap in pmem device itself */
> +	if (IS_ENABLED(CONFIG_ZONE_DEVICE)) {
> +
> +		addr = devm_memremap_pages(&vdev->dev, res,
> +				&q->q_usage_counter, altmap);
> +		vpmem->pfn_flags |= PFN_MAP;
> +	} else
> +		addr = devm_memremap(&vdev->dev, vpmem->start,
> +				vpmem->size, ARCH_MEMREMAP_PMEM);
> +
> +        /*
> +         * At release time the queue must be frozen before
> +         * devm_memremap_pages is unwound
> +         */
> +	if (devm_add_action_or_reset(&vdev->dev,
> +				virtio_pmem_freeze_queue, q))
> +		return -ENOMEM;
> +
> +	if (IS_ERR(addr))
> +		return PTR_ERR(addr);
> +
> +	vpmem->virt_addr = addr;
> +	blk_queue_write_cache(q, 0, 0);
> +	blk_queue_make_request(q, virtio_pmem_make_request);
> +	blk_queue_physical_block_size(q, PAGE_SIZE);
> +	blk_queue_logical_block_size(q, 512);
> +	blk_queue_max_hw_sectors(q, UINT_MAX);
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
> +	q->queuedata = vpmem;
> +
> +	disk = alloc_disk_node(0, nid);
> +	if (!disk)
> +		return -ENOMEM;

The return statements in this function look suspicious.  There probably
needs to be a cleanup to avoid memory leaks.

> +	vpmem->disk = disk;
> +
> +	disk->fops                = &pmem_fops;
> +	disk->queue               = q;
> +	disk->flags               = GENHD_FL_EXT_DEVT;
> +	strcpy(disk->disk_name, "vpmem");
> +	set_capacity(disk, vpmem->size/512);
> +	gendev = disk_to_dev(disk);
> +
> +	virtio_device_ready(vdev);
> +	device_add_disk(&vdev->dev, disk);
> +
> +	if (devm_add_action_or_reset(&vdev->dev,
> +			virtio_pmem_release_disk, vpmem))
> +		return -ENOMEM;
> +
> +	revalidate_disk(disk);
> +	return 0;
> +out:
> +	vdev->config->del_vqs(vdev);
> +	return err;
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> +	.feature_table		= features,
> +	.feature_table_size	= ARRAY_SIZE(features),
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtio_pmem_probe,
> +	//.remove		= virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..ec0c728c79ba
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,55 @@
> +/*
> + * Virtio pmem Device
> + *
> + *
> + */
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include <linux/types.h>
> +#include <linux/virtio_types.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <linux/pfn_t.h>
> +#include <linux/fs.h>
> +#include <linux/blk-mq.h>
> +#include <linux/pmem_common.h>

include/uapi/ headers must compile when included from userspace
applications.  The header should define userspace APIs only.

If you want to define kernel-internal APIs, please add them to
include/linux/ or similar.

This also means that include/uapi/ headers cannot include
include/linux/pmem_common.h or other kernel headers outside #ifdef
__KERNEL__.

This header file should only contain struct virtio_pmem_config,
VIRTIO_PMEM_PLUG, and other virtio device definitions.

> +
> +bool pmem_should_map_pages(struct device *dev);
> +
> +#define VIRTIO_PMEM_PLUG 0
> +
> +struct virtio_pmem_config {
> +
> +uint64_t start;
> +uint64_t size;
> +uint64_t align;
> +uint64_t data_offset;
> +};
> +
> +struct virtio_pmem {
> +
> +	struct virtio_device *vdev;
> +	struct virtqueue *req_vq;
> +
> +	uint64_t start;
> +	uint64_t size;
> +	uint64_t align;
> +	uint64_t data_offset;
> +	u64 pfn_flags;
> +	void *virt_addr;
> +	struct gendisk *disk;
> +} __packed;
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static unsigned int features[] = {
> +	VIRTIO_PMEM_PLUG,
> +};
> +
> +#endif
> +
> -- 
> 2.13.0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-13  9:44     ` Stefan Hajnoczi
@ 2017-10-13 10:48       ` Pankaj Gupta
       [not found]         ` <24301306.20068579.1507891695416.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]       ` <20171013094431.GA27308-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-13 10:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm, jack,
	stefanha, dan j williams, riel, haozhong zhang, nilal, kwolf,
	pbonzini, ross zwisler, david, xiaoguangrong eric


> 
> On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
> >   This patch adds virtio-pmem driver for KVM guest.
> >   Guest reads the persistent memory range information
> >   over virtio bus from Qemu and reserves the range
> >   as persistent memory. Guest also allocates a block
> >   device corresponding to the pmem range which later
> >   can be accessed with DAX compatible file systems.
> >   Idea is to use the virtio channel between guest and
> >   host to perform the block device flush for guest pmem
> >   DAX device.
> > 
> >   There is work to do including DAX file system support
> >   and other advanced features.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/virtio/Kconfig           |  10 ++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/virtio_pmem.c     | 322
> >  +++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_pmem.h |  55 +++++++
> >  4 files changed, 388 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index cff773f15b7e..0192c4bda54b 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -38,6 +38,16 @@ config VIRTIO_PCI_LEGACY
> >  
> >  	  If unsure, say Y.
> >  
> > +config VIRTIO_PMEM
> > +	tristate "Virtio pmem driver"
> > +	depends on VIRTIO
> > +	---help---
> > +	 This driver adds persistent memory range within a KVM guest.
> > +         It also associates a block device corresponding to the pmem
> > +	 range.
> > +
> > +	 If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> >  	tristate "Virtio balloon driver"
> >  	depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 41e30e3dc842..032ade725cc2 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -5,3 +5,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..74e47cae0e24
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,322 @@
> > +/*
> > + * virtio-pmem driver
> > + */
> > +
> > +#include <linux/virtio.h>
> > +#include <linux/swap.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/delay.h>
> > +#include <linux/slab.h>
> > +#include <linux/module.h>
> > +#include <linux/oom.h>
> > +#include <linux/wait.h>
> > +#include <linux/mm.h>
> > +#include <linux/mount.h>
> > +#include <linux/magic.h>
> > +#include <linux/virtio_pmem.h>
> > +
> > +void devm_vpmem_disable(struct device *dev, struct resource *res, void
> > *addr)
> > +{
> > +	devm_memunmap(dev, addr);
> > +	devm_release_mem_region(dev, res->start, resource_size(res));
> > +}
> > +
> > +static void pmem_flush_done(struct virtqueue *vq)
> > +{
> > +	return;
> > +};
> > +
> > +static void virtio_pmem_release_queue(void *q)
> > +{
> > +	blk_cleanup_queue(q);
> > +}
> > +
> > +static void virtio_pmem_freeze_queue(void *q)
> > +{
> > +	blk_freeze_queue_start(q);
> > +}
> > +
> > +static void virtio_pmem_release_disk(void *__pmem)
> > +{
> > +	struct virtio_pmem *pmem = __pmem;
> > +
> > +	del_gendisk(pmem->disk);
> > +	put_disk(pmem->disk);
> > +}
> > +
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +	struct virtqueue *vq;
> > +
> > +	/* single vq */
> > +	vq = virtio_find_single_vq(vpmem->vdev, pmem_flush_done, "flush_queue");
> > +
> > +	if (IS_ERR(vq))
> > +		return PTR_ERR(vq);
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vmem_altmap *setup_pmem_pfn(struct virtio_pmem *vpmem,
> > +			struct resource *res, struct vmem_altmap *altmap)
> > +{
> > +	u32 start_pad = 0, end_trunc = 0;
> > +	resource_size_t start, size;
> > +	unsigned long npfns;
> > +	phys_addr_t offset;
> > +
> > +	size = resource_size(res);
> > +	start = PHYS_SECTION_ALIGN_DOWN(res->start);
> > +
> > +	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> > +		IORES_DESC_NONE) == REGION_MIXED) {
> > +
> > +		start = res->start;
> > +		start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
> > +	}
> > +	start = res->start;
> > +	size = PHYS_SECTION_ALIGN_UP(start + size) - start;
> > +
> > +	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
> > +		IORES_DESC_NONE) == REGION_MIXED) {
> > +
> > +		size = resource_size(res);
> > +		end_trunc = start + size -
> > +				PHYS_SECTION_ALIGN_DOWN(start + size);
> > +	}
> > +
> > +	start += start_pad;
> > +	size = resource_size(res);
> > +	npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - SZ_8K)
> > +						/ PAGE_SIZE);
> > +
> > +      /*
> > +       * vmemmap_populate_hugepages() allocates the memmap array in
> > +       * HPAGE_SIZE chunks.
> > +       */
> > +	offset = ALIGN(start + SZ_8K + 64 * npfns, HPAGE_SIZE) - start;
> > +	vpmem->data_offset = offset;
> > +
> > +	struct vmem_altmap __altmap = {
> > +		.base_pfn = init_altmap_base(start+start_pad),
> > +		.reserve = init_altmap_reserve(start+start_pad),
> > +	};
> > +
> > +	res->start += start_pad;
> > +	res->end -= end_trunc;
> > +	memcpy(altmap, &__altmap, sizeof(*altmap));
> > +	altmap->free = PHYS_PFN(offset - SZ_8K);
> > +	altmap->alloc = 0;
> 
> The __altmap part can be rewritten using compound literal syntax:
> 
>   *altmap = (struct vmem_altmap) {
>       .base_pfn = init_altmap_base(start+start_pad),
>       .reserve = init_altmap_reserve(start+start_pad),
>       .free = PHYS_PFN(offset - SZ_8K),
>   };
> 
> This eliminates the temporary variable, memcpy, and explicit alloc = 0
> initialization.

o.k will use it.
  
> 
> > +
> > +	return altmap;
> > +}
> > +
> > +static blk_status_t pmem_do_bvec(struct virtio_pmem *pmem, struct page
> > *page,
> > +			unsigned int len, unsigned int off, bool is_write,
> > +			sector_t sector)
> > +{
> > +	blk_status_t rc = BLK_STS_OK;
> > +	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> > +	void *pmem_addr = pmem->virt_addr + pmem_off;
> > +
> > +	if (!is_write) {
> > +		rc = read_pmem(page, off, pmem_addr, len);
> > +			flush_dcache_page(page);
> > +	} else {
> > +		flush_dcache_page(page);
> 
> What are the semantics of this device?  Why flush the dcache here if an
> explicit flush command needs to be sent over the virtqueue?

yes, we don't need it in this case. I yet have to think more
about flushing with block and file-system level support :)

> 
> > +		write_pmem(pmem_addr, page, off, len);
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +static int vpmem_rw_page(struct block_device *bdev, sector_t sector,
> > +		       struct page *page, bool is_write)
> > +{
> > +	struct virtio_pmem  *pmem = bdev->bd_queue->queuedata;
> > +	blk_status_t rc;
> > +
> > +	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
> > +			  0, is_write, sector);
> > +
> > +	if (rc == 0)
> > +		page_endio(page, is_write, 0);
> > +
> > +	return blk_status_to_errno(rc);
> > +}
> > +
> > +#ifndef REQ_FLUSH
> > +#define REQ_FLUSH REQ_PREFLUSH
> > +#endif
> 
> Is this out-of-tree kernel module compatibility stuff that can be
> removed?

yes, will check what filesystem expects for flush if not used then
remove.

> 
> > +
> > +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> > +			struct bio *bio)
> > +{
> > +	blk_status_t rc = 0;
> > +	struct bio_vec bvec;
> > +	struct bvec_iter iter;
> > +	struct virtio_pmem *pmem = q->queuedata;
> > +
> > +	if (bio->bi_opf & REQ_FLUSH)
> > +		//todo host flush command
> 
> This detail is critical to the device design.  What is the plan?

yes, this is good point.

was thinking of guest sending a flush command to Qemu which
will do a fsync on file fd.

If we do a async flush and move the task to wait queue till we receive 
flush complete reply from host we can allow other tasks to execute
in current cpu.

Any suggestions you have or anything I am not foreseeing here?

> 
> > +
> > +	bio_for_each_segment(bvec, bio, iter) {
> > +		rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> > +				bvec.bv_offset, op_is_write(bio_op(bio)),
> > +				iter.bi_sector);
> > +		if (rc) {
> > +			bio->bi_status = rc;
> > +			break;
> > +		}
> > +	}
> > +
> > +	bio_endio(bio);
> > +	return BLK_QC_T_NONE;
> > +}
> > +
> > +static const struct block_device_operations pmem_fops = {
> > +	.owner =		THIS_MODULE,
> > +	.rw_page =		vpmem_rw_page,
> > +	//.revalidate_disk =	nvdimm_revalidate_disk,
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +	struct virtio_pmem *vpmem;
> > +	int err = 0;
> > +	void *addr;
> > +	struct resource *res, res_pfn;
> > +	struct request_queue *q;
> > +	struct vmem_altmap __altmap, *altmap = NULL;
> > +	struct gendisk *disk;
> > +	struct device *gendev;
> > +	int nid = dev_to_node(&vdev->dev);
> > +
> > +	if (!vdev->config->get) {
> > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +			__func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +			GFP_KERNEL);
> > +
> > +	if (!vpmem) {
> > +		err = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	dev_set_drvdata(&vdev->dev, vpmem);
> > +
> > +	vpmem->vdev = vdev;
> > +	err = init_vq(vpmem);
> > +	if (err)
> > +		goto out;
> > +
> > +	if (!virtio_has_feature(vdev, VIRTIO_PMEM_PLUG)) {
> > +		dev_err(&vdev->dev, "%s : pmem not supported\n",
> > +			__func__);
> > +		goto out;
> > +	}
> 
> Why is VIRTIO_PMEM_PLUG optional for this new device type if it's
> already required by the first version of the driver?

I just added it, initially kept it a place holder for any feature bit use.
I will remove it if we are not using it.

> 
> err is not set when the error path is taken.
> 
> > +
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			start, &vpmem->start);
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			size, &vpmem->size);
> 
> The struct resource start and size fields can vary between
> architectures.  Virtio device configuration space layout must not vary
> between architectures.  Please use u64.

sure.

> 
> > +
> > +	res_pfn.start = vpmem->start;
> > +	res_pfn.end   = vpmem->start + vpmem->size-1;
> > +
> > +	/* used for allocating memmap in the pmem device */
> > +	altmap	      = setup_pmem_pfn(vpmem, &res_pfn, &__altmap);
> > +
> > +	res = devm_request_mem_region(&vdev->dev,
> > +			res_pfn.start, resource_size(&res_pfn), "virtio-pmem");
> > +
> > +	if (!res) {
> > +		dev_warn(&vdev->dev, "could not reserve region ");
> > +		return -EBUSY;
> > +	}
> > +
> > +	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(&vdev->dev));
> > +
> > +	if (!q)
> > +		return -ENOMEM;
> > +
> > +	if (devm_add_action_or_reset(&vdev->dev,
> > +				virtio_pmem_release_queue, q))
> > +		return -ENOMEM;
> > +
> > +	vpmem->pfn_flags = PFN_DEV;
> > +
> > +	/* allocate memap in pmem device itself */
> > +	if (IS_ENABLED(CONFIG_ZONE_DEVICE)) {
> > +
> > +		addr = devm_memremap_pages(&vdev->dev, res,
> > +				&q->q_usage_counter, altmap);
> > +		vpmem->pfn_flags |= PFN_MAP;
> > +	} else
> > +		addr = devm_memremap(&vdev->dev, vpmem->start,
> > +				vpmem->size, ARCH_MEMREMAP_PMEM);
> > +
> > +        /*
> > +         * At release time the queue must be frozen before
> > +         * devm_memremap_pages is unwound
> > +         */
> > +	if (devm_add_action_or_reset(&vdev->dev,
> > +				virtio_pmem_freeze_queue, q))
> > +		return -ENOMEM;
> > +
> > +	if (IS_ERR(addr))
> > +		return PTR_ERR(addr);
> > +
> > +	vpmem->virt_addr = addr;
> > +	blk_queue_write_cache(q, 0, 0);
> > +	blk_queue_make_request(q, virtio_pmem_make_request);
> > +	blk_queue_physical_block_size(q, PAGE_SIZE);
> > +	blk_queue_logical_block_size(q, 512);
> > +	blk_queue_max_hw_sectors(q, UINT_MAX);
> > +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> > +	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
> > +	q->queuedata = vpmem;
> > +
> > +	disk = alloc_disk_node(0, nid);
> > +	if (!disk)
> > +		return -ENOMEM;
> 
> The return statements in this function look suspicious.  There probably
> needs to be a cleanup to avoid memory leaks.

Sure! valid point.
> 
> > +	vpmem->disk = disk;
> > +
> > +	disk->fops                = &pmem_fops;
> > +	disk->queue               = q;
> > +	disk->flags               = GENHD_FL_EXT_DEVT;
> > +	strcpy(disk->disk_name, "vpmem");
> > +	set_capacity(disk, vpmem->size/512);
> > +	gendev = disk_to_dev(disk);
> > +
> > +	virtio_device_ready(vdev);
> > +	device_add_disk(&vdev->dev, disk);
> > +
> > +	if (devm_add_action_or_reset(&vdev->dev,
> > +			virtio_pmem_release_disk, vpmem))
> > +		return -ENOMEM;
> > +
> > +	revalidate_disk(disk);
> > +	return 0;
> > +out:
> > +	vdev->config->del_vqs(vdev);
> > +	return err;
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +	.feature_table		= features,
> > +	.feature_table_size	= ARRAY_SIZE(features),
> > +	.driver.name		= KBUILD_MODNAME,
> > +	.driver.owner		= THIS_MODULE,
> > +	.id_table		= id_table,
> > +	.probe			= virtio_pmem_probe,
> > +	//.remove		= virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..ec0c728c79ba
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,55 @@
> > +/*
> > + * Virtio pmem Device
> > + *
> > + *
> > + */
> > +
> > +#ifndef _LINUX_VIRTIO_PMEM_H
> > +#define _LINUX_VIRTIO_PMEM_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/virtio_types.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <linux/pfn_t.h>
> > +#include <linux/fs.h>
> > +#include <linux/blk-mq.h>
> > +#include <linux/pmem_common.h>
> 
> include/uapi/ headers must compile when included from userspace
> applications.  The header should define userspace APIs only.
> 
> If you want to define kernel-internal APIs, please add them to
> include/linux/ or similar.
> 
> This also means that include/uapi/ headers cannot include
> include/linux/pmem_common.h or other kernel headers outside #ifdef
> __KERNEL__.

o.k

> 
> This header file should only contain struct virtio_pmem_config,
> VIRTIO_PMEM_PLUG, and other virtio device definitions.

Sure!
> 
> > +
> > +bool pmem_should_map_pages(struct device *dev);
> > +
> > +#define VIRTIO_PMEM_PLUG 0
> > +
> > +struct virtio_pmem_config {
> > +
> > +uint64_t start;
> > +uint64_t size;
> > +uint64_t align;
> > +uint64_t data_offset;
> > +};
> > +
> > +struct virtio_pmem {
> > +
> > +	struct virtio_device *vdev;
> > +	struct virtqueue *req_vq;
> > +
> > +	uint64_t start;
> > +	uint64_t size;
> > +	uint64_t align;
> > +	uint64_t data_offset;
> > +	u64 pfn_flags;
> > +	void *virt_addr;
> > +	struct gendisk *disk;
> > +} __packed;
> > +
> > +static struct virtio_device_id id_table[] = {
> > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +	{ 0 },
> > +};
> > +
> > +static unsigned int features[] = {
> > +	VIRTIO_PMEM_PLUG,
> > +};
> > +
> > +#endif
> > +
> > --
> > 2.13.0
> > 
> 

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
       [not found]       ` <20171013094431.GA27308-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
@ 2017-10-13 15:25         ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2017-10-13 15:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Pankaj Gupta, Rik van Riel, Jan Kara, Xiao Guangrong,
	KVM list, linux-nvdimm, David Hildenbrand, Zwisler, Ross,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers, Linux MM,
	Stefan Hajnoczi, Paolo Bonzini, Nitesh Narayan Lal

On Fri, Oct 13, 2017 at 2:44 AM, Stefan Hajnoczi <stefanha-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
[..]
>> +#ifndef REQ_FLUSH
>> +#define REQ_FLUSH REQ_PREFLUSH
>> +#endif
>
> Is this out-of-tree kernel module compatibility stuff that can be
> removed?

Yes, this was copied from the pmem driver where it can also be
removed, it was used to workaround a merge order problem in linux-next
when these definitions were changed several kernel releases back.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
       [not found]         ` <24301306.20068579.1507891695416.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-16 14:47           ` Stefan Hajnoczi
  2017-10-16 15:58             ` Dan Williams
  2017-10-16 17:04             ` Pankaj Gupta
  0 siblings, 2 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2017-10-16 14:47 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA,
	jack-AlSwsSmVLrQ, xiaoguangrong eric, kvm-u79uwXL29TY76Z2rM5mHXA,
	david-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w, ross zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, stefanha-H+wXaHxf7aLQT0dZR+AlfA,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, nilal-H+wXaHxf7aLQT0dZR+AlfA

On Fri, Oct 13, 2017 at 06:48:15AM -0400, Pankaj Gupta wrote:
> > On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
> > > +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> > > +			struct bio *bio)
> > > +{
> > > +	blk_status_t rc = 0;
> > > +	struct bio_vec bvec;
> > > +	struct bvec_iter iter;
> > > +	struct virtio_pmem *pmem = q->queuedata;
> > > +
> > > +	if (bio->bi_opf & REQ_FLUSH)
> > > +		//todo host flush command
> > 
> > This detail is critical to the device design.  What is the plan?
> 
> yes, this is good point.
> 
> was thinking of guest sending a flush command to Qemu which
> will do a fsync on file fd.

Previously there was discussion about fsyncing a specific file range
instead of the whole file.  This could perform better in cases where
only a subset of dirty pages need to be flushed.

One possibility is to design the virtio interface to communicate ranges
but the emulation code simply fsyncs the fd for the time being.  Later
on, if the necessary kernel and userspace interfaces are added, we can
make use of the interface.

> If we do a async flush and move the task to wait queue till we receive 
> flush complete reply from host we can allow other tasks to execute
> in current cpu.
> 
> Any suggestions you have or anything I am not foreseeing here?

My main thought about this patch series is whether pmem should be a
virtio-blk feature bit instead of a whole new device.  There is quite a
bit of overlap between the two.

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-16 14:47           ` Stefan Hajnoczi
@ 2017-10-16 15:58             ` Dan Williams
  2017-10-16 17:04             ` Pankaj Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Dan Williams @ 2017-10-16 15:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Pankaj Gupta, linux-kernel, KVM list, Qemu Developers,
	linux-nvdimm, Linux MM, Jan Kara, Stefan Hajnoczi, Rik van Riel,
	haozhong zhang, Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini,
	ross zwisler, David Hildenbrand, xiaoguangrong eric

On Mon, Oct 16, 2017 at 7:47 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Oct 13, 2017 at 06:48:15AM -0400, Pankaj Gupta wrote:
>> > On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
>> > > +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
>> > > +                 struct bio *bio)
>> > > +{
>> > > + blk_status_t rc = 0;
>> > > + struct bio_vec bvec;
>> > > + struct bvec_iter iter;
>> > > + struct virtio_pmem *pmem = q->queuedata;
>> > > +
>> > > + if (bio->bi_opf & REQ_FLUSH)
>> > > +         //todo host flush command
>> >
>> > This detail is critical to the device design.  What is the plan?
>>
>> yes, this is good point.
>>
>> was thinking of guest sending a flush command to Qemu which
>> will do a fsync on file fd.
>
> Previously there was discussion about fsyncing a specific file range
> instead of the whole file.  This could perform better in cases where
> only a subset of dirty pages need to be flushed.
>
> One possibility is to design the virtio interface to communicate ranges
> but the emulation code simply fsyncs the fd for the time being.  Later
> on, if the necessary kernel and userspace interfaces are added, we can
> make use of the interface.

Range based is not a natural storage cache management mechanism. All
that is it available typically is a full write-cache-flush mechanism
and upper layers would need to customized for range-based flushing.

>> If we do a async flush and move the task to wait queue till we receive
>> flush complete reply from host we can allow other tasks to execute
>> in current cpu.
>>
>> Any suggestions you have or anything I am not foreseeing here?
>
> My main thought about this patch series is whether pmem should be a
> virtio-blk feature bit instead of a whole new device.  There is quite a
> bit of overlap between the two.

I'd be open to that... there's already provisions in the pmem driver
for platforms where cpu caches are flushed on power-loss, a virtio
mode for this shared-memory case seems reasonable.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-16 14:47           ` Stefan Hajnoczi
  2017-10-16 15:58             ` Dan Williams
@ 2017-10-16 17:04             ` Pankaj Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-16 17:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm, jack,
	stefanha, dan j williams, riel, haozhong zhang, nilal, kwolf,
	pbonzini, ross zwisler, david, xiaoguangrong eric


> 
> On Fri, Oct 13, 2017 at 06:48:15AM -0400, Pankaj Gupta wrote:
> > > On Thu, Oct 12, 2017 at 09:20:26PM +0530, Pankaj Gupta wrote:
> > > > +static blk_qc_t virtio_pmem_make_request(struct request_queue *q,
> > > > +			struct bio *bio)
> > > > +{
> > > > +	blk_status_t rc = 0;
> > > > +	struct bio_vec bvec;
> > > > +	struct bvec_iter iter;
> > > > +	struct virtio_pmem *pmem = q->queuedata;
> > > > +
> > > > +	if (bio->bi_opf & REQ_FLUSH)
> > > > +		//todo host flush command
> > > 
> > > This detail is critical to the device design.  What is the plan?
> > 
> > yes, this is good point.
> > 
> > was thinking of guest sending a flush command to Qemu which
> > will do a fsync on file fd.
> 
> Previously there was discussion about fsyncing a specific file range
> instead of the whole file.  This could perform better in cases where
> only a subset of dirty pages need to be flushed.

yes, We had discussion about this and decided to do entire block flush
then to range level flush.

> 
> One possibility is to design the virtio interface to communicate ranges
> but the emulation code simply fsyncs the fd for the time being.  Later
> on, if the necessary kernel and userspace interfaces are added, we can
> make use of the interface.
> 
> > If we do a async flush and move the task to wait queue till we receive
> > flush complete reply from host we can allow other tasks to execute
> > in current cpu.
> > 
> > Any suggestions you have or anything I am not foreseeing here?
> 
> My main thought about this patch series is whether pmem should be a
> virtio-blk feature bit instead of a whole new device.  There is quite a
> bit of overlap between the two.

Exposing options with existing virtio-blk device to be used as persistent memory
range at high level would require additional below features:

- Use a persistent memory range with an option to allocate memmap array in the device
  itself for .

- Block operations for DAX and persistent memory range.

- Bifurcation at filesystem level based on type of virtio-blk device selected.

- Bifurcation of flushing interface and communication channel between guest & host.

But yes these features can be dynamically configured based on type of device
added? What if we have virtio-blk:virtio-pmem (m:n) devices ratio?And scale involved? 

If i understand correctly virtio-blk is high performance interface with multiqueue support 
and additional features at host side like data-plane mode etc. If we bloat it with additional
stuff(even when we need them) and provide locking with additional features both at guest as 
well as host side we will get a hit in performance? Also as requirement of both the interfaces
would grow it will be more difficult to maintain? I would prefer more simpler interfaces with
defined functionality but yes common code can be shared and used using well defined wrappers. 

> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC 2/2] KVM: add virtio-pmem driver
       [not found]     ` <20171012155027.3277-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-10-17  7:16       ` Christoph Hellwig
  2017-10-17  7:40         ` [Qemu-devel] " Pankaj Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-10-17  7:16 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA,
	jack-AlSwsSmVLrQ, xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w,
	kvm-u79uwXL29TY76Z2rM5mHXA, david-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	ross.zwisler-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, stefanha-H+wXaHxf7aLQT0dZR+AlfA,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, nilal-H+wXaHxf7aLQT0dZR+AlfA

I think this driver is at entirely the wrong level.

If you want to expose pmem to a guest with flushing assist do it
as pmem, and not a block driver.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-17  7:16       ` Christoph Hellwig
@ 2017-10-17  7:40         ` Pankaj Gupta
  2017-10-17  8:02           ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-17  7:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kwolf, haozhong zhang, jack, xiaoguangrong eric, kvm, david,
	linux-nvdimm, ross zwisler, linux-kernel, qemu-devel, linux-mm,
	stefanha, pbonzini, dan j williams, nilal


> 
> I think this driver is at entirely the wrong level.
> 
> If you want to expose pmem to a guest with flushing assist do it
> as pmem, and not a block driver.

Are you saying do it as existing i.e ACPI pmem like interface?
The reason we have created this new driver is exiting pmem driver
does not define proper semantics for guest flushing requests.

Regarding block support of driver, we want to achieve DAX support
to bypass guest page cache. Also, we want to utilize existing DAX
capable file-system interfaces(e.g fsync) from userspace file API's
to trigger the host side flush request.

Below link has details of previous discussion:
https://marc.info/?l=kvm&m=150091133700361&w=2

Thanks,
Pankaj  

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-17  7:40         ` [Qemu-devel] " Pankaj Gupta
@ 2017-10-17  8:02           ` Christoph Hellwig
  2017-10-17  8:30             ` Pankaj Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-10-17  8:02 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Christoph Hellwig, kwolf, haozhong zhang, jack,
	xiaoguangrong eric, kvm, david, linux-nvdimm, ross zwisler,
	linux-kernel, qemu-devel, linux-mm, stefanha, pbonzini,
	dan j williams, nilal

On Tue, Oct 17, 2017 at 03:40:56AM -0400, Pankaj Gupta wrote:
> Are you saying do it as existing i.e ACPI pmem like interface?
> The reason we have created this new driver is exiting pmem driver
> does not define proper semantics for guest flushing requests.

At this point I'm caring about the Linux-internal interface, and
for that it should be integrated into the nvdimm subsystem and not
a block driver.  How the host <-> guest interface looks is a different
idea.

> 
> Regarding block support of driver, we want to achieve DAX support
> to bypass guest page cache. Also, we want to utilize existing DAX
> capable file-system interfaces(e.g fsync) from userspace file API's
> to trigger the host side flush request.

Well, if you want to support XFS+DAX better don't make it a block
devices, because I'll post patches soon to stop using the block device
entirely for the DAX case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-17  8:02           ` Christoph Hellwig
@ 2017-10-17  8:30             ` Pankaj Gupta
  2017-10-18 13:03               ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Pankaj Gupta @ 2017-10-17  8:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kwolf, haozhong zhang, jack, xiaoguangrong eric, kvm, david,
	linux-nvdimm, ross zwisler, linux-kernel, qemu-devel, linux-mm,
	stefanha, pbonzini, dan j williams, nilal


> > Are you saying do it as existing i.e ACPI pmem like interface?
> > The reason we have created this new driver is exiting pmem driver
> > does not define proper semantics for guest flushing requests.
> 
> At this point I'm caring about the Linux-internal interface, and
> for that it should be integrated into the nvdimm subsystem and not
> a block driver.  How the host <-> guest interface looks is a different
> idea.
> 
> > 
> > Regarding block support of driver, we want to achieve DAX support
> > to bypass guest page cache. Also, we want to utilize existing DAX
> > capable file-system interfaces(e.g fsync) from userspace file API's
> > to trigger the host side flush request.
> 
> Well, if you want to support XFS+DAX better don't make it a block
> devices, because I'll post patches soon to stop using the block device
> entirely for the DAX case.

o.k I will look at your patches once they are in mailing list.
Thanks for the heads up.

If I am guessing it right, we don't need block device additional features
for pmem? We can bypass block device features like blk device cache flush etc.
Also, still we would be supporting ext4 & XFS filesystem with pmem?

If there is time to your patches can you please elaborate on this a bit.

Thanks,
Pankaj

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-17  8:30             ` Pankaj Gupta
@ 2017-10-18 13:03               ` Stefan Hajnoczi
  2017-10-18 15:51                 ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2017-10-18 13:03 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Christoph Hellwig, kwolf, haozhong zhang, jack,
	xiaoguangrong eric, kvm, david, linux-nvdimm, ross zwisler,
	linux-kernel, qemu-devel, linux-mm, stefanha, pbonzini,
	dan j williams, nilal

On Tue, Oct 17, 2017 at 04:30:41AM -0400, Pankaj Gupta wrote:
> 
> > > Are you saying do it as existing i.e ACPI pmem like interface?
> > > The reason we have created this new driver is exiting pmem driver
> > > does not define proper semantics for guest flushing requests.
> > 
> > At this point I'm caring about the Linux-internal interface, and
> > for that it should be integrated into the nvdimm subsystem and not
> > a block driver.  How the host <-> guest interface looks is a different
> > idea.
> > 
> > > 
> > > Regarding block support of driver, we want to achieve DAX support
> > > to bypass guest page cache. Also, we want to utilize existing DAX
> > > capable file-system interfaces(e.g fsync) from userspace file API's
> > > to trigger the host side flush request.
> > 
> > Well, if you want to support XFS+DAX better don't make it a block
> > devices, because I'll post patches soon to stop using the block device
> > entirely for the DAX case.
> 
> o.k I will look at your patches once they are in mailing list.
> Thanks for the heads up.
> 
> If I am guessing it right, we don't need block device additional features
> for pmem? We can bypass block device features like blk device cache flush etc.
> Also, still we would be supporting ext4 & XFS filesystem with pmem?
> 
> If there is time to your patches can you please elaborate on this a bit.

I think the idea is that the nvdimm subsystem already adds block device
semantics on top of the struct nvdimms that it manages.  See
drivers/nvdimm/blk.c.

So it would be cleaner to make virtio-pmem an nvdimm bus.  This will
eliminate the duplication between your driver and drivers/nvdimm/ code.
Try "git grep nvdimm_bus_register" to find drivers that use the nvdimm
subsystem.

The virtio-pmem driver could register itself similar to the existing
ACPI NFIT and BIOS e820 pmem drivers.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-18 13:03               ` Stefan Hajnoczi
@ 2017-10-18 15:51                 ` Dan Williams
       [not found]                   ` <CAPcyv4h6aFkyHhh4R4DTznbSCLf9CuBoszk0Q1gB5EKNcp_SeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-10-19  8:01                   ` Christoph Hellwig
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Williams @ 2017-10-18 15:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Pankaj Gupta, Christoph Hellwig, Kevin Wolf, haozhong zhang,
	Jan Kara, xiaoguangrong eric, KVM list, David Hildenbrand,
	linux-nvdimm, ross zwisler, linux-kernel, Qemu Developers,
	Linux MM, Stefan Hajnoczi, Paolo Bonzini, Nitesh Narayan Lal

On Wed, Oct 18, 2017 at 6:03 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Oct 17, 2017 at 04:30:41AM -0400, Pankaj Gupta wrote:
>>
>> > > Are you saying do it as existing i.e ACPI pmem like interface?
>> > > The reason we have created this new driver is exiting pmem driver
>> > > does not define proper semantics for guest flushing requests.
>> >
>> > At this point I'm caring about the Linux-internal interface, and
>> > for that it should be integrated into the nvdimm subsystem and not
>> > a block driver.  How the host <-> guest interface looks is a different
>> > idea.
>> >
>> > >
>> > > Regarding block support of driver, we want to achieve DAX support
>> > > to bypass guest page cache. Also, we want to utilize existing DAX
>> > > capable file-system interfaces(e.g fsync) from userspace file API's
>> > > to trigger the host side flush request.
>> >
>> > Well, if you want to support XFS+DAX better don't make it a block
>> > devices, because I'll post patches soon to stop using the block device
>> > entirely for the DAX case.
>>
>> o.k I will look at your patches once they are in mailing list.
>> Thanks for the heads up.
>>
>> If I am guessing it right, we don't need block device additional features
>> for pmem? We can bypass block device features like blk device cache flush etc.
>> Also, still we would be supporting ext4 & XFS filesystem with pmem?
>>
>> If there is time to your patches can you please elaborate on this a bit.
>
> I think the idea is that the nvdimm subsystem already adds block device
> semantics on top of the struct nvdimms that it manages.  See
> drivers/nvdimm/blk.c.
>
> So it would be cleaner to make virtio-pmem an nvdimm bus.  This will
> eliminate the duplication between your driver and drivers/nvdimm/ code.
> Try "git grep nvdimm_bus_register" to find drivers that use the nvdimm
> subsystem.

This use case is not "Persistent Memory". Persistent Memory is
something you can map and make persistent with CPU instructions.
Anything that requires a driver call is device driver managed "Shared
Memory".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
       [not found]                   ` <CAPcyv4h6aFkyHhh4R4DTznbSCLf9CuBoszk0Q1gB5EKNcp_SeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-19  8:01                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2017-10-19  8:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Pankaj Gupta, Jan Kara, xiaoguangrong eric, KVM list,
	linux-nvdimm, David Hildenbrand, ross zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers,
	Christoph Hellwig, Linux MM, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Wed, Oct 18, 2017 at 08:51:37AM -0700, Dan Williams wrote:
> On Wed, Oct 18, 2017 at 6:03 AM, Stefan Hajnoczi <stefanha-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Tue, Oct 17, 2017 at 04:30:41AM -0400, Pankaj Gupta wrote:
> >>
> >> > > Are you saying do it as existing i.e ACPI pmem like interface?
> >> > > The reason we have created this new driver is exiting pmem driver
> >> > > does not define proper semantics for guest flushing requests.
> >> >
> >> > At this point I'm caring about the Linux-internal interface, and
> >> > for that it should be integrated into the nvdimm subsystem and not
> >> > a block driver.  How the host <-> guest interface looks is a different
> >> > idea.
> >> >
> >> > >
> >> > > Regarding block support of driver, we want to achieve DAX support
> >> > > to bypass guest page cache. Also, we want to utilize existing DAX
> >> > > capable file-system interfaces(e.g fsync) from userspace file API's
> >> > > to trigger the host side flush request.
> >> >
> >> > Well, if you want to support XFS+DAX better don't make it a block
> >> > devices, because I'll post patches soon to stop using the block device
> >> > entirely for the DAX case.
> >>
> >> o.k I will look at your patches once they are in mailing list.
> >> Thanks for the heads up.
> >>
> >> If I am guessing it right, we don't need block device additional features
> >> for pmem? We can bypass block device features like blk device cache flush etc.
> >> Also, still we would be supporting ext4 & XFS filesystem with pmem?
> >>
> >> If there is time to your patches can you please elaborate on this a bit.
> >
> > I think the idea is that the nvdimm subsystem already adds block device
> > semantics on top of the struct nvdimms that it manages.  See
> > drivers/nvdimm/blk.c.
> >
> > So it would be cleaner to make virtio-pmem an nvdimm bus.  This will
> > eliminate the duplication between your driver and drivers/nvdimm/ code.
> > Try "git grep nvdimm_bus_register" to find drivers that use the nvdimm
> > subsystem.
> 
> This use case is not "Persistent Memory". Persistent Memory is
> something you can map and make persistent with CPU instructions.
> Anything that requires a driver call is device driver managed "Shared
> Memory".

Dan, in that case do you have ideas regarding Christoph Hellwig's
comment that this driver should be integrated into the nvdimm subsystem
instead of a new block driver?

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-18 15:51                 ` Dan Williams
       [not found]                   ` <CAPcyv4h6aFkyHhh4R4DTznbSCLf9CuBoszk0Q1gB5EKNcp_SeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-19  8:01                   ` Christoph Hellwig
       [not found]                     ` <20171019080149.GB10089-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-10-19  8:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Stefan Hajnoczi, Pankaj Gupta, Christoph Hellwig, Kevin Wolf,
	haozhong zhang, Jan Kara, xiaoguangrong eric, KVM list,
	David Hildenbrand, linux-nvdimm, ross zwisler, linux-kernel,
	Qemu Developers, Linux MM, Stefan Hajnoczi, Paolo Bonzini,
	Nitesh Narayan Lal

On Wed, Oct 18, 2017 at 08:51:37AM -0700, Dan Williams wrote:
> This use case is not "Persistent Memory". Persistent Memory is
> something you can map and make persistent with CPU instructions.
> Anything that requires a driver call is device driver managed "Shared
> Memory".

How is this any different than the existing nvdimm_flush()? If you
really care about the not driver thing it could easily be a write
to a doorbell page or a hypercall, but in the end that's just semantics.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
       [not found]                     ` <20171019080149.GB10089-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2017-10-19 18:21                       ` Dan Williams
       [not found]                         ` <CAPcyv4j=Cdp68C15HddKaErpve2UGRfSTiL6bHiS=3gQybz9pg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-19 18:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, KVM list, Pankaj Gupta,
	Stefan Hajnoczi, David Hildenbrand, ross zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers, Linux MM,
	Stefan Hajnoczi, linux-nvdimm, Paolo Bonzini, Nitesh Narayan Lal

On Thu, Oct 19, 2017 at 1:01 AM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Wed, Oct 18, 2017 at 08:51:37AM -0700, Dan Williams wrote:
>> This use case is not "Persistent Memory". Persistent Memory is
>> something you can map and make persistent with CPU instructions.
>> Anything that requires a driver call is device driver managed "Shared
>> Memory".
>
> How is this any different than the existing nvdimm_flush()? If you
> really care about the not driver thing it could easily be a write
> to a doorbell page or a hypercall, but in the end that's just semantics.

The difference is that nvdimm_flush() is not mandatory, and that the
platform will automatically perform the same flush at power-fail.
Applications should be able to assume that if they are using MAP_SYNC
that no other coordination with the kernel or the hypervisor is
necessary.

Advertising this as a generic Persistent Memory range to the guest
means that the guest could theoretically use it with device-dax where
there is no driver or filesystem sync interface. The hypervisor will
be waiting for flush notifications and the guest will just issue cache
flushes and sfence instructions. So, as far as I can see we need to
differentiate this virtio-model from standard "Persistent Memory" to
the guest and remove the possibility of guests/applications making the
wrong assumption.

Non-ODP RDMA in a guest comes to mind...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
       [not found]                         ` <CAPcyv4j=Cdp68C15HddKaErpve2UGRfSTiL6bHiS=3gQybz9pg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-10-20  8:00                           ` Christoph Hellwig
       [not found]                             ` <20171020080049.GA25471-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-10-20  8:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, KVM list, Pankaj Gupta,
	Stefan Hajnoczi, David Hildenbrand, ross zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers,
	Christoph Hellwig, Linux MM, Stefan Hajnoczi, linux-nvdimm,
	Paolo Bonzini, Nitesh Narayan Lal

On Thu, Oct 19, 2017 at 11:21:26AM -0700, Dan Williams wrote:
> The difference is that nvdimm_flush() is not mandatory, and that the
> platform will automatically perform the same flush at power-fail.
> Applications should be able to assume that if they are using MAP_SYNC
> that no other coordination with the kernel or the hypervisor is
> necessary.
> 
> Advertising this as a generic Persistent Memory range to the guest
> means that the guest could theoretically use it with device-dax where
> there is no driver or filesystem sync interface. The hypervisor will
> be waiting for flush notifications and the guest will just issue cache
> flushes and sfence instructions. So, as far as I can see we need to
> differentiate this virtio-model from standard "Persistent Memory" to
> the guest and remove the possibility of guests/applications making the
> wrong assumption.

So add a flag that it is not.  We already have the nd_volatile type,
that is special.  For now only in Linux, but I think adding this type
to the spec eventually would be very useful for efficiently exposing
directly mappable device to VM guests.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
       [not found]                             ` <20171020080049.GA25471-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2017-10-20 15:05                               ` Dan Williams
  2017-10-20 16:06                                 ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2017-10-20 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, KVM list, Pankaj Gupta,
	Stefan Hajnoczi, David Hildenbrand, ross zwisler,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Qemu Developers, Linux MM,
	Stefan Hajnoczi, linux-nvdimm, Paolo Bonzini, Nitesh Narayan Lal

On Fri, Oct 20, 2017 at 1:00 AM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Thu, Oct 19, 2017 at 11:21:26AM -0700, Dan Williams wrote:
>> The difference is that nvdimm_flush() is not mandatory, and that the
>> platform will automatically perform the same flush at power-fail.
>> Applications should be able to assume that if they are using MAP_SYNC
>> that no other coordination with the kernel or the hypervisor is
>> necessary.
>>
>> Advertising this as a generic Persistent Memory range to the guest
>> means that the guest could theoretically use it with device-dax where
>> there is no driver or filesystem sync interface. The hypervisor will
>> be waiting for flush notifications and the guest will just issue cache
>> flushes and sfence instructions. So, as far as I can see we need to
>> differentiate this virtio-model from standard "Persistent Memory" to
>> the guest and remove the possibility of guests/applications making the
>> wrong assumption.
>
> So add a flag that it is not.  We already have the nd_volatile type,
> that is special.  For now only in Linux, but I think adding this type
> to the spec eventually would be very useful for efficiently exposing
> directly mappable device to VM guests.

Right, that's the same recommendation I gave.

    https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html

...so maybe I'm misunderstanding your concern? It sounds like we're on
the same page.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-20 15:05                               ` Dan Williams
@ 2017-10-20 16:06                                 ` Christoph Hellwig
  2017-10-20 16:11                                   ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Kevin Wolf, Jan Kara, xiaoguangrong eric,
	KVM list, Pankaj Gupta, Stefan Hajnoczi, David Hildenbrand,
	ross zwisler, linux-kernel, Qemu Developers, Linux MM,
	Stefan Hajnoczi, linux-nvdimm, Paolo Bonzini, Nitesh Narayan Lal

On Fri, Oct 20, 2017 at 08:05:09AM -0700, Dan Williams wrote:
> Right, that's the same recommendation I gave.
> 
>     https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html
> 
> ...so maybe I'm misunderstanding your concern? It sounds like we're on
> the same page.

Yes, the above is exactly what I think we should do it.  And in many
ways virtio seems overkill if we could just have a hypercall or doorbell
page as the queueing infrastructure in virtio shouldn't really be
needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] KVM: add virtio-pmem driver
  2017-10-20 16:06                                 ` Christoph Hellwig
@ 2017-10-20 16:11                                   ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2017-10-20 16:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Jan Kara, xiaoguangrong eric, KVM list, Pankaj Gupta,
	Stefan Hajnoczi, David Hildenbrand, ross zwisler, linux-kernel,
	Qemu Developers, Linux MM, Stefan Hajnoczi, linux-nvdimm,
	Paolo Bonzini, Nitesh Narayan Lal

On Fri, Oct 20, 2017 at 9:06 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Oct 20, 2017 at 08:05:09AM -0700, Dan Williams wrote:
>> Right, that's the same recommendation I gave.
>>
>>     https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html
>>
>> ...so maybe I'm misunderstanding your concern? It sounds like we're on
>> the same page.
>
> Yes, the above is exactly what I think we should do it.  And in many
> ways virtio seems overkill if we could just have a hypercall or doorbell
> page as the queueing infrastructure in virtio shouldn't really be
> needed.

Ah ok, yes, get rid of the virtio-pmem driver and just make
nvdimm_flush() do a hypercall based on region-type flag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-10-20 16:11 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-12 15:50 [RFC 0/2] KVM "fake DAX" device flushing Pankaj Gupta
2017-10-12 15:50 ` [RFC 1/2] pmem: Move reusable code to base header files Pankaj Gupta
2017-10-12 20:42   ` Dan Williams
2017-10-12 21:27     ` [Qemu-devel] " Pankaj Gupta
     [not found] ` <20171012155027.3277-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-12 15:50   ` [RFC 2/2] KVM: add virtio-pmem driver Pankaj Gupta
2017-10-12 20:51     ` Dan Williams
2017-10-12 21:25       ` Pankaj Gupta
2017-10-12 21:54         ` Dan Williams
     [not found]           ` <CAPcyv4gkri7t+3Unf0sc9AHMnz-v9G_qV_bJppLjUUNAn7drrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-12 22:18             ` Pankaj Gupta
2017-10-12 22:27               ` Rik van Riel
2017-10-12 22:39                 ` Pankaj Gupta
2017-10-12 22:52                 ` Pankaj Gupta
2017-10-12 22:59                   ` Dan Williams
2017-10-12 23:07                     ` Pankaj Gupta
2017-10-13  9:44     ` Stefan Hajnoczi
2017-10-13 10:48       ` Pankaj Gupta
     [not found]         ` <24301306.20068579.1507891695416.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-16 14:47           ` Stefan Hajnoczi
2017-10-16 15:58             ` Dan Williams
2017-10-16 17:04             ` Pankaj Gupta
     [not found]       ` <20171013094431.GA27308-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
2017-10-13 15:25         ` Dan Williams
     [not found]     ` <20171012155027.3277-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-10-17  7:16       ` Christoph Hellwig
2017-10-17  7:40         ` [Qemu-devel] " Pankaj Gupta
2017-10-17  8:02           ` Christoph Hellwig
2017-10-17  8:30             ` Pankaj Gupta
2017-10-18 13:03               ` Stefan Hajnoczi
2017-10-18 15:51                 ` Dan Williams
     [not found]                   ` <CAPcyv4h6aFkyHhh4R4DTznbSCLf9CuBoszk0Q1gB5EKNcp_SeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-19  8:01                     ` Stefan Hajnoczi
2017-10-19  8:01                   ` Christoph Hellwig
     [not found]                     ` <20171019080149.GB10089-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-10-19 18:21                       ` Dan Williams
     [not found]                         ` <CAPcyv4j=Cdp68C15HddKaErpve2UGRfSTiL6bHiS=3gQybz9pg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-10-20  8:00                           ` Christoph Hellwig
     [not found]                             ` <20171020080049.GA25471-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-10-20 15:05                               ` Dan Williams
2017-10-20 16:06                                 ` Christoph Hellwig
2017-10-20 16:11                                   ` Dan Williams
2017-10-12 15:50 ` [RFC] QEMU: Add virtio pmem device Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).