nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] kvm "fake DAX" device
@ 2018-10-13  5:00 Pankaj Gupta
       [not found] ` <20181013050021.11962-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-10-13  5:00 ` [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver Pankaj Gupta
  0 siblings, 2 replies; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-13  5:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w
  Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA, jack-AlSwsSmVLrQ,
	xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w,
	riel-ebMLmSuQjDVBDgjK7y7TUQ, david-H+wXaHxf7aLQT0dZR+AlfA,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	imammedo-H+wXaHxf7aLQT0dZR+AlfA, mst-H+wXaHxf7aLQT0dZR+AlfA,
	stefanha-H+wXaHxf7aLQT0dZR+AlfA, zwisler-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, nilal-H+wXaHxf7aLQT0dZR+AlfA,
	eblake-H+wXaHxf7aLQT0dZR+AlfA

 This patch series has implementation for "fake DAX". 
 "fake DAX" is fake persistent memory(nvdimm) in guest 
 which allows to bypass the guest page cache. This also
 implements a VIRTIO based asynchronous flush mechanism.  
 
 Sharing guest kernel driver in this patchset with the 
 changes suggested in v1. Tested with Qemu side device 
 emulation for virtio-pmem [4]. 
 
 Details of project idea for 'fake DAX' flushing interface 
 is shared [2] & [3].

 Implementation is divided into two parts:
 New virtio pmem guest driver and qemu code changes for new 
 virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
   - Reads persistent memory range from paravirt device and 
     registers with 'nvdimm_bus'.  
   - 'nvdimm/pmem' driver uses this information to allocate 
     persistent memory region and setup filesystem operations 
     to the allocated memory. 
   - virtio pmem driver implements asynchronous flushing 
     interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
   - Creates virtio pmem device and exposes a memory range to 
     KVM guest. 
   - At host side this is file backed memory which acts as 
     persistent memory. 
   - Qemu side flush uses aio thread pool API's and virtio 
     for asynchronous guest multi request handling. 

   David Hildenbrand CCed also posted a modified version[5] of 
   qemu virtio-pmem code based on updated Qemu memory device API. 

 Virtio-pmem errors handling:
 ----------------------------------------
  Checked behaviour of virtio-pmem for below types of errors
  Need suggestions on expected behaviour for handling these errors?

  - Hardware Errors: Uncorrectable recoverable Errors: 
  a] virtio-pmem: 
    - As per current logic if error page belongs to Qemu process, 
      host MCE handler isolates(hwpoison) that page and send SIGBUS. 
      Qemu SIGBUS handler injects exception to KVM guest. 
    - KVM guest then isolates the page and send SIGBUS to guest 
      userspace process which has mapped the page. 
  
  b] Existing implementation for ACPI pmem driver: 
    - Handles such errors with MCE notifier and creates a list 
      of bad blocks. Read/direct access DAX operation return EIO 
      if accessed memory page fall in bad block list.
    - It also starts backgound scrubbing.  
    - Similar functionality can be reused in virtio-pmem with MCE 
      notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
      confirm if this behaviour is ok or needs any change?

Changes from PATCH v1: [1]
- 0-day build test for build dependency on libnvdimm 

 Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem  
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c

Changes from RFC v2:
- Add flush function in the nd_region in place of switching
  on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
  for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan

Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent 
  memory and other operations instead of creating an entirely 
  new block driver.
- Use VIRTIO driver to register memory information with 
  nvdimm_bus and create region_type accordingly. 
- Call VIRTIO flush from existing pmem driver.

Pankaj Gupta (2):
   libnvdimm: nd_region flush callback support
   virtio-pmem: Add virtio-pmem guest driver

[1] https://lkml.org/lkml/2018/8/31/407
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html  
[4] https://lkml.org/lkml/2018/8/31/413
[5] https://marc.info/?l=qemu-devel&m=153555721901824&w=2

 drivers/acpi/nfit/core.c     |    8 ++--
 drivers/nvdimm/claim.c       |   12 ++++--
 drivers/nvdimm/nd.h          |    2 +
 drivers/nvdimm/pmem.c        |   24 +++++++++----
 drivers/nvdimm/region_devs.c |   76 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/libnvdimm.h    |   10 ++++-
 6 files changed, 110 insertions(+), 22 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] libnvdimm: nd_region flush callback support
       [not found] ` <20181013050021.11962-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-10-13  5:00   ` Pankaj Gupta
  2018-10-13  8:31     ` kbuild test robot
       [not found]     ` <20181013050021.11962-2-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-10-13 16:06   ` [PATCH v2 0/2] kvm "fake DAX" device Dan Williams
  1 sibling, 2 replies; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-13  5:00 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w
  Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA, jack-AlSwsSmVLrQ,
	xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w,
	riel-ebMLmSuQjDVBDgjK7y7TUQ, david-H+wXaHxf7aLQT0dZR+AlfA,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	imammedo-H+wXaHxf7aLQT0dZR+AlfA, mst-H+wXaHxf7aLQT0dZR+AlfA,
	stefanha-H+wXaHxf7aLQT0dZR+AlfA, zwisler-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, nilal-H+wXaHxf7aLQT0dZR+AlfA,
	eblake-H+wXaHxf7aLQT0dZR+AlfA

This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace. 

This also handles asynchronous flush requests from the block layer 
by creating a child bio and chaining it with parent bio.

Signed-off-by: Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/acpi/nfit/core.c     |  4 ++--
 drivers/nvdimm/claim.c       |  6 ++++--
 drivers/nvdimm/nd.h          |  1 +
 drivers/nvdimm/pmem.c        | 12 ++++++++----
 drivers/nvdimm/region_devs.c | 38 ++++++++++++++++++++++++++++++++++++--
 include/linux/libnvdimm.h    |  5 ++++-
 6 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index b072cfc..f154852 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2234,7 +2234,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 		offset = to_interleave_offset(offset, mmio);
 
 	writeq(cmd, mmio->addr.base + offset);
-	nvdimm_flush(nfit_blk->nd_region);
+	nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
 		readq(mmio->addr.base + offset);
@@ -2283,7 +2283,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 	}
 
 	if (rw)
-		nvdimm_flush(nfit_blk->nd_region);
+		nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
 	return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf..a1dfa06 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
 	sector_t sector = offset >> 9;
-	int rc = 0;
+	int rc = 0, ret = 0;
 
 	if (unlikely(!size))
 		return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	}
 
 	memcpy_flushcache(nsio->addr + offset, buf, size);
-	nvdimm_flush(to_nd_region(ndns->dev.parent));
+	ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL, false);
+	if (ret)
+		rc = ret;
 
 	return rc;
 }
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 98317e7..d53a2d1 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -160,6 +160,7 @@ struct nd_region {
 	struct nd_interleave_set *nd_set;
 	struct nd_percpu_lane __percpu *lane;
 	struct nd_mapping mapping[0];
+	int (*flush)(struct nd_region *nd_region);
 };
 
 struct nd_blk_region {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 6071e29..5d6a4a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 
 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 {
+	int ret = 0;
 	blk_status_t rc = 0;
 	bool do_acct;
 	unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct nd_region *nd_region = to_region(pmem);
 
 	if (bio->bi_opf & REQ_PREFLUSH)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 		nd_iostat_end(bio, start);
 
 	if (bio->bi_opf & REQ_FUA)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
+
+	if (ret)
+		bio->bi_status = errno_to_blk_status(ret);
 
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
@@ -528,14 +532,14 @@ static int nd_pmem_remove(struct device *dev)
 		sysfs_put(pmem->bb_state);
 		pmem->bb_state = NULL;
 	}
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 
 	return 0;
 }
 
 static void nd_pmem_shutdown(struct device *dev)
 {
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 }
 
 static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fa37afc..5508727 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -290,7 +290,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
 		return rc;
 	if (!flush)
 		return -EINVAL;
-	nvdimm_flush(nd_region);
+	rc = nvdimm_flush(nd_region, NULL, false);
+	if (rc)
+		return rc;
 
 	return len;
 }
@@ -1065,6 +1067,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	dev->of_node = ndr_desc->of_node;
 	nd_region->ndr_size = resource_size(ndr_desc->res);
 	nd_region->ndr_start = ndr_desc->res->start;
+	if (ndr_desc->flush)
+		nd_region->flush = ndr_desc->flush;
+	else
+		nd_region->flush = generic_nvdimm_flush;
+
 	nd_device_register(dev);
 
 	return nd_region;
@@ -1105,11 +1112,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 }
 EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
 
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
+{
+	int rc = 0;
+
+	/* Create child bio for asynchronous flush and chain with
+	 * parent bio. Otherwise directly call nd_region flush.
+	 */
+	if (async && bio->bi_iter.bi_sector != -1) {
+
+		struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+		if (!child)
+			return -ENOMEM;
+		bio_copy_dev(child, bio);
+		child->bi_opf = REQ_PREFLUSH;
+		child->bi_iter.bi_sector = -1;
+		bio_chain(child, bio);
+		submit_bio(child);
+	} else {
+		if (nd_region->flush(nd_region))
+			rc = -EIO;
+	}
+
+	return rc;
+}
 /**
  * nvdimm_flush - flush any posted write queues between the cpu and pmem media
  * @nd_region: blk or interleaved pmem region
  */
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
 {
 	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
 	int i, idx;
@@ -1133,6 +1165,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 		if (ndrd_get_flush_wpq(ndrd, i, 0))
 			writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
 	wmb();
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(nvdimm_flush);
 
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 097072c..b49632c 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -115,6 +115,7 @@ struct nd_mapping_desc {
 	int position;
 };
 
+struct nd_region;
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping_desc *mapping;
@@ -126,6 +127,7 @@ struct nd_region_desc {
 	int numa_node;
 	unsigned long flags;
 	struct device_node *of_node;
+	int (*flush)(struct nd_region *nd_region);
 };
 
 struct device;
@@ -201,7 +203,8 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
 unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
 void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
+int generic_nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
 int nvdimm_has_cache(struct nd_region *nd_region);
 
-- 
2.9.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2]  virtio-pmem: Add virtio pmem driver
  2018-10-13  5:00 [PATCH v2 0/2] kvm "fake DAX" device Pankaj Gupta
       [not found] ` <20181013050021.11962-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-10-13  5:00 ` Pankaj Gupta
       [not found]   ` <20181013050021.11962-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-13  5:00 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini,
	zwisler, vishal.l.verma, dave.jiang, david, xiaoguangrong.eric,
	hch, mst, lcapitulino, imammedo, eblake, pagupta

This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/nvdimm/virtio_pmem.c     |  84 ++++++++++++++++++++++++++
 drivers/virtio/Kconfig           |  10 ++++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_pmem.h      |  60 +++++++++++++++++++
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  10 ++++
 7 files changed, 290 insertions(+)
 create mode 100644 drivers/nvdimm/virtio_pmem.c
 create mode 100644 drivers/virtio/pmem.c
 create mode 100644 include/linux/virtio_pmem.h
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 0000000..2a1b1ba
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+	unsigned int len;
+	unsigned long flags;
+	struct virtio_pmem_request *req, *req_buf;
+	struct virtio_pmem *vpmem = vq->vdev->priv;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+		req->done = true;
+		wake_up(&req->host_acked);
+
+		if (!list_empty(&vpmem->req_list)) {
+			req_buf = list_first_entry(&vpmem->req_list,
+					struct virtio_pmem_request, list);
+			list_del(&vpmem->req_list);
+			req_buf->wq_buf_avail = true;
+			wake_up(&req_buf->wq_buf);
+		}
+	}
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+	int err;
+	unsigned long flags;
+	struct scatterlist *sgs[2], sg, ret;
+	struct virtio_device *vdev = nd_region->provider_data;
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct virtio_pmem_request *req;
+
+	might_sleep();
+	req = kmalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->done = req->wq_buf_avail = false;
+	strcpy(req->name, "FLUSH");
+	init_waitqueue_head(&req->host_acked);
+	init_waitqueue_head(&req->wq_buf);
+	sg_init_one(&sg, req->name, strlen(req->name));
+	sgs[0] = &sg;
+	sg_init_one(&ret, &req->ret, sizeof(req->ret));
+	sgs[1] = &ret;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+	if (err) {
+		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+		list_add_tail(&vpmem->req_list, &req->list);
+		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+		/* When host has read buffer, this completes via host_ack */
+		wait_event(req->wq_buf, req->wq_buf_avail);
+		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	}
+	virtqueue_kick(vpmem->req_vq);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	/* When host has read buffer, this completes via host_ack */
+	wait_event(req->host_acked, req->done);
+	err = req->ret;
+	kfree(req);
+
+	return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3589764..9f634a2 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Support for virtio pmem driver"
+	depends on VIRTIO
+	depends on LIBNVDIMM
+	help
+	This driver provides support for virtio based flushing interface
+	for persistent memory range.
+
+	If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5..143ce91 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
new file mode 100644
index 0000000..51f5349
--- /dev/null
+++ b/drivers/virtio/pmem.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include <../../drivers/nvdimm/nd.h>
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+				host_ack, "flush_queue");
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	spin_lock_init(&vpmem->pmem_lock);
+	INIT_LIST_HEAD(&vpmem->req_list);
+
+	return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	int err = 0;
+	struct resource res;
+	struct virtio_pmem *vpmem;
+	struct nvdimm_bus *nvdimm_bus;
+	struct nd_region_desc ndr_desc;
+	int nid = dev_to_node(&vdev->dev);
+	struct nd_region *nd_region;
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+			GFP_KERNEL);
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out_err;
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+
+	res.start = vpmem->start;
+	res.end   = vpmem->start + vpmem->size-1;
+	vpmem->nd_desc.provider_name = "virtio-pmem";
+	vpmem->nd_desc.module = THIS_MODULE;
+
+	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+						&vpmem->nd_desc);
+	if (!nvdimm_bus)
+		goto out_vq;
+
+	dev_set_drvdata(&vdev->dev, nvdimm_bus);
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+	ndr_desc.res = &res;
+	ndr_desc.numa_node = nid;
+	ndr_desc.flush = virtio_pmem_flush;
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+	nd_region->provider_data =  dev_to_virtio
+					(nd_region->dev.parent->parent);
+
+	if (!nd_region)
+		goto out_nd;
+
+	//virtio_device_ready(vdev);
+	return 0;
+out_nd:
+	err = -ENXIO;
+	nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+	vdev->config->del_vqs(vdev);
+out_err:
+	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+	return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
+	kfree(vpmem);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	.remove			= virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 0000000..224f9d9
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+	/* Host return status corresponding to flush request */
+	int ret;
+
+	/* command name*/
+	char name[16];
+
+	/* Wait queue to process deferred work after ack from host */
+	wait_queue_head_t host_acked;
+	bool done;
+
+	/* Wait queue to process deferred work after virt queue buffer avail */
+	wait_queue_head_t wq_buf;
+	bool wq_buf_avail;
+	struct list_head list;
+};
+
+struct virtio_pmem {
+	struct virtio_device *vdev;
+
+	/* Virtio pmem request queue */
+	struct virtqueue *req_vq;
+
+	/* nvdimm bus registers virtio pmem device */
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm_bus_descriptor nd_desc;
+
+	/* List to store deferred work if virtqueue is full */
+	struct list_head req_list;
+
+	/* Synchronize virtqueue data */
+	spinlock_t pmem_lock;
+
+	/* Memory region information */
+	uint64_t start;
+	uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int virtio_pmem_flush(struct nd_region *nd_region);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2..3463895 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         25 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 0000000..fa3f7d5
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+	__le64 start;
+	__le64 size;
+};
+#endif
-- 
2.9.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] libnvdimm: nd_region flush callback support
  2018-10-13  5:00   ` [PATCH v2 1/2] libnvdimm: nd_region flush callback support Pankaj Gupta
@ 2018-10-13  8:31     ` kbuild test robot
       [not found]     ` <20181013050021.11962-2-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2018-10-13  8:31 UTC (permalink / raw)
  Cc: kbuild-all, linux-kernel, kvm, qemu-devel, linux-nvdimm, jack,
	stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini, zwisler,
	vishal.l.verma, dave.jiang, david, xiaoguangrong.eric, hch, mst,
	lcapitulino, imammedo, eblake, pagupta

[-- Attachment #1: Type: text/plain, Size: 4472 bytes --]

Hi Pankaj,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.19-rc7 next-20181012]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pankaj-Gupta/libnvdimm-nd_region-flush-callback-support/20181013-152624
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: x86_64-randconfig-x017-201840 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from drivers/nvdimm/bus.c:14:0:
>> include/linux/libnvdimm.h:206:54: warning: 'struct bio' declared inside parameter list will not be visible outside of this definition or declaration
    int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
                                                         ^~~

vim +206 include/linux/libnvdimm.h

   159	
   160	void badrange_init(struct badrange *badrange);
   161	int badrange_add(struct badrange *badrange, u64 addr, u64 length);
   162	void badrange_forget(struct badrange *badrange, phys_addr_t start,
   163			unsigned int len);
   164	int nvdimm_bus_add_badrange(struct nvdimm_bus *nvdimm_bus, u64 addr,
   165			u64 length);
   166	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
   167			struct nvdimm_bus_descriptor *nfit_desc);
   168	void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus);
   169	struct nvdimm_bus *to_nvdimm_bus(struct device *dev);
   170	struct nvdimm *to_nvdimm(struct device *dev);
   171	struct nd_region *to_nd_region(struct device *dev);
   172	struct device *nd_region_dev(struct nd_region *nd_region);
   173	struct nd_blk_region *to_nd_blk_region(struct device *dev);
   174	struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
   175	struct device *to_nvdimm_bus_dev(struct nvdimm_bus *nvdimm_bus);
   176	const char *nvdimm_name(struct nvdimm *nvdimm);
   177	struct kobject *nvdimm_kobj(struct nvdimm *nvdimm);
   178	unsigned long nvdimm_cmd_mask(struct nvdimm *nvdimm);
   179	void *nvdimm_provider_data(struct nvdimm *nvdimm);
   180	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
   181			const struct attribute_group **groups, unsigned long flags,
   182			unsigned long cmd_mask, int num_flush,
   183			struct resource *flush_wpq);
   184	const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd);
   185	const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd);
   186	u32 nd_cmd_in_size(struct nvdimm *nvdimm, int cmd,
   187			const struct nd_cmd_desc *desc, int idx, void *buf);
   188	u32 nd_cmd_out_size(struct nvdimm *nvdimm, int cmd,
   189			const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
   190			const u32 *out_field, unsigned long remainder);
   191	int nvdimm_bus_check_dimm_count(struct nvdimm_bus *nvdimm_bus, int dimm_count);
   192	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
   193			struct nd_region_desc *ndr_desc);
   194	struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
   195			struct nd_region_desc *ndr_desc);
   196	struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
   197			struct nd_region_desc *ndr_desc);
   198	void *nd_region_provider_data(struct nd_region *nd_region);
   199	void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
   200	void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
   201	struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
   202	unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
   203	unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
   204	void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
   205	u64 nd_fletcher64(void *addr, size_t len, bool le);
 > 206	int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
   207	int generic_nvdimm_flush(struct nd_region *nd_region);
   208	int nvdimm_has_flush(struct nd_region *nd_region);
   209	int nvdimm_has_cache(struct nd_region *nd_region);
   210	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33326 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] libnvdimm: nd_region flush callback support
       [not found]     ` <20181013050021.11962-2-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-10-13  9:38       ` kbuild test robot
  0 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2018-10-13  9:38 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: jack-AlSwsSmVLrQ, kvm-u79uwXL29TY76Z2rM5mHXA,
	david-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A, eblake-H+wXaHxf7aLQT0dZR+AlfA,
	mst-H+wXaHxf7aLQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	nilal-H+wXaHxf7aLQT0dZR+AlfA, riel-ebMLmSuQjDVBDgjK7y7TUQ,
	stefanha-H+wXaHxf7aLQT0dZR+AlfA, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	kwolf-H+wXaHxf7aLQT0dZR+AlfA,
	xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kbuild-all-JC7UmRfGjtg,
	imammedo-H+wXaHxf7aLQT0dZR+AlfA

Hi Pankaj,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.19-rc7 next-20181012]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pankaj-Gupta/libnvdimm-nd_region-flush-callback-support/20181013-152624
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: x86_64-randconfig-g0-10131621 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from drivers//acpi/nfit/core.c:14:0:
>> include/linux/libnvdimm.h:206:54: warning: 'struct bio' declared inside parameter list
    int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
                                                         ^
>> include/linux/libnvdimm.h:206:54: warning: its scope is only this definition or declaration, which is probably not what you want

vim +206 include/linux/libnvdimm.h

   159	
   160	void badrange_init(struct badrange *badrange);
   161	int badrange_add(struct badrange *badrange, u64 addr, u64 length);
   162	void badrange_forget(struct badrange *badrange, phys_addr_t start,
   163			unsigned int len);
   164	int nvdimm_bus_add_badrange(struct nvdimm_bus *nvdimm_bus, u64 addr,
   165			u64 length);
   166	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
   167			struct nvdimm_bus_descriptor *nfit_desc);
   168	void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus);
   169	struct nvdimm_bus *to_nvdimm_bus(struct device *dev);
   170	struct nvdimm *to_nvdimm(struct device *dev);
   171	struct nd_region *to_nd_region(struct device *dev);
   172	struct device *nd_region_dev(struct nd_region *nd_region);
   173	struct nd_blk_region *to_nd_blk_region(struct device *dev);
   174	struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
   175	struct device *to_nvdimm_bus_dev(struct nvdimm_bus *nvdimm_bus);
   176	const char *nvdimm_name(struct nvdimm *nvdimm);
   177	struct kobject *nvdimm_kobj(struct nvdimm *nvdimm);
   178	unsigned long nvdimm_cmd_mask(struct nvdimm *nvdimm);
   179	void *nvdimm_provider_data(struct nvdimm *nvdimm);
   180	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
   181			const struct attribute_group **groups, unsigned long flags,
   182			unsigned long cmd_mask, int num_flush,
   183			struct resource *flush_wpq);
   184	const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd);
   185	const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd);
   186	u32 nd_cmd_in_size(struct nvdimm *nvdimm, int cmd,
   187			const struct nd_cmd_desc *desc, int idx, void *buf);
   188	u32 nd_cmd_out_size(struct nvdimm *nvdimm, int cmd,
   189			const struct nd_cmd_desc *desc, int idx, const u32 *in_field,
   190			const u32 *out_field, unsigned long remainder);
   191	int nvdimm_bus_check_dimm_count(struct nvdimm_bus *nvdimm_bus, int dimm_count);
   192	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
   193			struct nd_region_desc *ndr_desc);
   194	struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
   195			struct nd_region_desc *ndr_desc);
   196	struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
   197			struct nd_region_desc *ndr_desc);
   198	void *nd_region_provider_data(struct nd_region *nd_region);
   199	void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
   200	void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
   201	struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
   202	unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
   203	unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
   204	void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
   205	u64 nd_fletcher64(void *addr, size_t len, bool le);
 > 206	int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
   207	int generic_nvdimm_flush(struct nd_region *nd_region);
   208	int nvdimm_has_flush(struct nd_region *nd_region);
   209	int nvdimm_has_cache(struct nd_region *nd_region);
   210	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] kvm "fake DAX" device
       [not found] ` <20181013050021.11962-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-10-13  5:00   ` [PATCH v2 1/2] libnvdimm: nd_region flush callback support Pankaj Gupta
@ 2018-10-13 16:06   ` Dan Williams
  2018-10-13 16:29     ` [Qemu-devel] " Pankaj Gupta
  1 sibling, 1 reply; 11+ messages in thread
From: Dan Williams @ 2018-10-13 16:06 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, KVM list, Rik van Riel,
	linux-nvdimm, David Hildenbrand, Linux Kernel Mailing List,
	Qemu Developers, Christoph Hellwig, Igor Mammedov,
	Michael S. Tsirkin, Stefan Hajnoczi,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Paolo Bonzini,
	Nitesh Narayan Lal, Eric Blake

On Fri, Oct 12, 2018 at 10:00 PM Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>  This patch series has implementation for "fake DAX".
>  "fake DAX" is fake persistent memory(nvdimm) in guest
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.

Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's
starting to confuse people. This enabling is effectively a
host-page-cache-passthrough mechanism not DAX. Let's call the whole
approach virtio-pmem, and leave DAX out of the name to hopefully
prevent people from wondering why some DAX features are disabled with
this driver. For example MAP_SYNC is not compatible with this
approach.

Additional enabling is need to disable MAP_SYNC in the presence of a
virtio-pmem device. See the rough proposal here:

    https://lkml.org/lkml/2018/4/25/756

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
       [not found]   ` <20181013050021.11962-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-10-13 16:10     ` Dan Williams
       [not found]       ` <CAPcyv4hroaVWA-HgjWCnr7QTd_y7U8sCvS+Up733ttnD6_cKzQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2018-10-13 16:10 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, KVM list, Rik van Riel,
	linux-nvdimm, David Hildenbrand, Linux Kernel Mailing List,
	Qemu Developers, Christoph Hellwig, Igor Mammedov,
	Michael S. Tsirkin, Stefan Hajnoczi,
	zwisler-DgEjT+Ai2ygdnm+yROfE0A,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Paolo Bonzini,
	Nitesh Narayan Lal, Eric Blake

On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.

Before we can move forward with this driver we need additional
filesystem enabling to detect when the backing device is fronting DAX
pmem or a paravirtualized page cache through virtio-pmem. Any
interface that requires fsync() and a round trip to the hypervisor to
flush host page cache is not DAX.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/2] kvm "fake DAX" device
  2018-10-13 16:06   ` [PATCH v2 0/2] kvm "fake DAX" device Dan Williams
@ 2018-10-13 16:29     ` Pankaj Gupta
  0 siblings, 0 replies; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-13 16:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, KVM list, Rik van Riel,
	linux-nvdimm, David Hildenbrand, Linux Kernel Mailing List,
	Dave Jiang, Qemu Developers, Christoph Hellwig, Vishal L Verma,
	Igor Mammedov, Michael S. Tsirkin, Stefan Hajnoczi, zwisler,
	lcapitulino, Paolo Bonzini, Nitesh Narayan Lal


> >
> >  This patch series has implementation for "fake DAX".
> >  "fake DAX" is fake persistent memory(nvdimm) in guest
> >  which allows to bypass the guest page cache. This also
> >  implements a VIRTIO based asynchronous flush mechanism.
> 
> Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's
> starting to confuse people. This enabling is effectively a
> host-page-cache-passthrough mechanism not DAX. Let's call the whole
> approach virtio-pmem, and leave DAX out of the name to hopefully
> prevent people from wondering why some DAX features are disabled with
> this driver. For example MAP_SYNC is not compatible with this
> approach.

Sure. I got your point. I will use "virtio-pmem" in future.  

> 
> Additional enabling is need to disable MAP_SYNC in the presence of a
> virtio-pmem device. See the rough proposal here:
> 
>     https://lkml.org/lkml/2018/4/25/756

Yes, I will handle disabling of MAP_SYNC for this use-case.


Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
       [not found]       ` <CAPcyv4hroaVWA-HgjWCnr7QTd_y7U8sCvS+Up733ttnD6_cKzQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-17 19:11         ` Pankaj Gupta
  2018-10-17 19:36           ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-17 19:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Nitesh Narayan Lal, Jan Kara, Xiao Guangrong,
	KVM list, Rik van Riel, linux-nvdimm, David Hildenbrand,
	Linux Kernel Mailing List, Qemu Developers, Christoph Hellwig,
	Igor Mammedov, Paolo Bonzini, Michael S. Tsirkin,
	Stefan Hajnoczi, zwisler-DgEjT+Ai2ygdnm+yROfE0A,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA



> On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> 
> Before we can move forward with this driver we need additional
> filesystem enabling to detect when the backing device is fronting DAX
> pmem or a paravirtualized page cache through virtio-pmem. Any
> interface that requires fsync() and a round trip to the hypervisor to
> flush host page cache is not DAX.

I saw your proposal[1] for new mmap flag MAP_DIRECT. IIUIC mapping should fail for 
MAP_DIRECT if it requires explicit flush or buffer indirection. So, if we disable 
MAP_SYNC flag for virtio-pmem this should fail MAP_DIRECT as well? Otherwise 
without MAP_DIRECT, virtio-pmem should be defaulted to VIRTIO flush mechanism.     

[1] https://marc.info/?l=linux-fsdevel&m=153953206330814&w=2

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
  2018-10-17 19:11         ` [Qemu-devel] " Pankaj Gupta
@ 2018-10-17 19:36           ` Dan Williams
  2018-10-18  1:44             ` Pankaj Gupta
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2018-10-17 19:36 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Kevin Wolf, Jan Kara, Xiao Guangrong, KVM list, Rik van Riel,
	linux-nvdimm, David Hildenbrand, Linux Kernel Mailing List,
	Dave Jiang, Qemu Developers, Christoph Hellwig, Vishal L Verma,
	Igor Mammedov, Michael S. Tsirkin, Stefan Hajnoczi, zwisler,
	lcapitulino, Paolo Bonzini, Nitesh Narayan Lal

On Wed, Oct 17, 2018 at 12:11 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>
>
>
> > On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta <pagupta@redhat.com> wrote:
> > >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> >
> > Before we can move forward with this driver we need additional
> > filesystem enabling to detect when the backing device is fronting DAX
> > pmem or a paravirtualized page cache through virtio-pmem. Any
> > interface that requires fsync() and a round trip to the hypervisor to
> > flush host page cache is not DAX.
>
> I saw your proposal[1] for new mmap flag MAP_DIRECT. IIUIC mapping should fail for
> MAP_DIRECT if it requires explicit flush or buffer indirection. So, if we disable
> MAP_SYNC flag for virtio-pmem this should fail MAP_DIRECT as well? Otherwise
> without MAP_DIRECT, virtio-pmem should be defaulted to VIRTIO flush mechanism.

Right, although I wouldn't worry about MAP_DIRECT in the short term
since we're still discussing what the upstream interface. Regardless
of whether MAP_DIRECT is specified or not the virtio-flush mechanism
would always be used for virtio-pmem. I.e. there is no possibility to
get full DAX operation with virtio-pmem, only the page-cache bypass
sub-set.

Taking a look at where we could inject this check for filesystems it's
a bit awkward to do it in xfs_file_mmap() for example because we do
not have the backing device for the extents of the inode. So at a
minimum you would need to investigate calling xfs_inode_supports_dax()
from that path and teaching it about a new dax_device flag. I'm
thinking the dax_device flag should be called DAXDEV_BUFFERED to
indicate the presence of software buffering on a device that otherwise
supports bypassing the local page cache.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver
  2018-10-17 19:36           ` Dan Williams
@ 2018-10-18  1:44             ` Pankaj Gupta
  0 siblings, 0 replies; 11+ messages in thread
From: Pankaj Gupta @ 2018-10-18  1:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kevin Wolf, Nitesh Narayan Lal, Jan Kara, Xiao Guangrong,
	KVM list, Rik van Riel, linux-nvdimm, David Hildenbrand,
	Linux Kernel Mailing List, Qemu Developers, Christoph Hellwig,
	Igor Mammedov, Paolo Bonzini, Michael S. Tsirkin,
	Stefan Hajnoczi, Vishal L Verma, zwisler, Dave Jiang,
	lcapitulino


> >
> >
> > > On Fri, Oct 12, 2018 at 10:01 PM Pankaj Gupta <pagupta@redhat.com> wrote:
> > > >
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > >
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > >
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > >
> > > Before we can move forward with this driver we need additional
> > > filesystem enabling to detect when the backing device is fronting DAX
> > > pmem or a paravirtualized page cache through virtio-pmem. Any
> > > interface that requires fsync() and a round trip to the hypervisor to
> > > flush host page cache is not DAX.
> >
> > I saw your proposal[1] for new mmap flag MAP_DIRECT. IIUIC mapping should
> > fail for
> > MAP_DIRECT if it requires explicit flush or buffer indirection. So, if we
> > disable
> > MAP_SYNC flag for virtio-pmem this should fail MAP_DIRECT as well?
> > Otherwise
> > without MAP_DIRECT, virtio-pmem should be defaulted to VIRTIO flush
> > mechanism.
> 
> Right, although I wouldn't worry about MAP_DIRECT in the short term
> since we're still discussing what the upstream interface. Regardless
> of whether MAP_DIRECT is specified or not the virtio-flush mechanism
> would always be used for virtio-pmem. I.e. there is no possibility to
> get full DAX operation with virtio-pmem, only the page-cache bypass
> sub-set.

Agree. I will also follow the thread.

> 
> Taking a look at where we could inject this check for filesystems it's
> a bit awkward to do it in xfs_file_mmap() for example because we do
> not have the backing device for the extents of the inode. So at a
> minimum you would need to investigate calling xfs_inode_supports_dax()
> from that path and teaching it about a new dax_device flag. I'm
> thinking the dax_device flag should be called DAXDEV_BUFFERED to
> indicate the presence of software buffering on a device that otherwise
> supports bypassing the local page cache.

Sure. Will investigate XFS code as suggested. 
Thanks for the detail directions towards the solution. Will try to come up 
with a solution.

Best regards,
Pankaj
  

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-10-18  1:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13  5:00 [PATCH v2 0/2] kvm "fake DAX" device Pankaj Gupta
     [not found] ` <20181013050021.11962-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-10-13  5:00   ` [PATCH v2 1/2] libnvdimm: nd_region flush callback support Pankaj Gupta
2018-10-13  8:31     ` kbuild test robot
     [not found]     ` <20181013050021.11962-2-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-10-13  9:38       ` kbuild test robot
2018-10-13 16:06   ` [PATCH v2 0/2] kvm "fake DAX" device Dan Williams
2018-10-13 16:29     ` [Qemu-devel] " Pankaj Gupta
2018-10-13  5:00 ` [PATCH v2 2/2] virtio-pmem: Add virtio pmem driver Pankaj Gupta
     [not found]   ` <20181013050021.11962-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-10-13 16:10     ` Dan Williams
     [not found]       ` <CAPcyv4hroaVWA-HgjWCnr7QTd_y7U8sCvS+Up733ttnD6_cKzQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-17 19:11         ` [Qemu-devel] " Pankaj Gupta
2018-10-17 19:36           ` Dan Williams
2018-10-18  1:44             ` Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).