linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] kvm "virtio pmem" device
@ 2019-01-09 13:50 Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 1/5] libnvdimm: nd_region flush callback support Pankaj Gupta
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 13:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-fsdevel,
	virtualization, linux-acpi, linux-ext4, linux-xfs
  Cc: jack, stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini,
	zwisler, vishal.l.verma, dave.jiang, david, jmoyer,
	xiaoguangrong.eric, hch, mst, jasowang, lcapitulino, imammedo,
	eblake, willy, tytso, adilger.kernel, darrick.wong, rjw, pagupta

 This patch series has implementation for "virtio pmem". 
 "virtio pmem" is fake persistent memory(nvdimm) in guest 
 which allows to bypass the guest page cache. This also
 implements a VIRTIO based asynchronous flush mechanism.  
 
 Sharing guest kernel driver in this patchset with the 
 changes suggested in v2. Tested with Qemu side device 
 emulation for virtio-pmem [6]. 
 
 Details of project idea for 'virtio pmem' flushing interface 
 is shared [3] & [4].

 Implementation is divided into two parts:
 New virtio pmem guest driver and qemu code changes for new 
 virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
   - Reads persistent memory range from paravirt device and 
     registers with 'nvdimm_bus'.  
   - 'nvdimm/pmem' driver uses this information to allocate 
     persistent memory region and setup filesystem operations 
     to the allocated memory. 
   - virtio pmem driver implements asynchronous flushing 
     interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
   - Creates virtio pmem device and exposes a memory range to 
     KVM guest. 
   - At host side this is file backed memory which acts as 
     persistent memory. 
   - Qemu side flush uses aio thread pool API's and virtio 
     for asynchronous guest multi request handling. 

   David Hildenbrand CCed also posted a modified version[6] of 
   qemu virtio-pmem code based on updated Qemu memory device API. 

 Virtio-pmem errors handling:
 ----------------------------------------
  Checked behaviour of virtio-pmem for below types of errors
  Need suggestions on expected behaviour for handling these errors?

  - Hardware Errors: Uncorrectable recoverable Errors: 
  a] virtio-pmem: 
    - As per current logic if error page belongs to Qemu process, 
      host MCE handler isolates(hwpoison) that page and send SIGBUS. 
      Qemu SIGBUS handler injects exception to KVM guest. 
    - KVM guest then isolates the page and send SIGBUS to guest 
      userspace process which has mapped the page. 
  
  b] Existing implementation for ACPI pmem driver: 
    - Handles such errors with MCE notifier and creates a list 
      of bad blocks. Read/direct access DAX operation return EIO 
      if accessed memory page fall in bad block list.
    - It also starts backgound scrubbing.  
    - Similar functionality can be reused in virtio-pmem with MCE 
      notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
      confirm if this behaviour is ok or needs any change?

Changes from PATCH v2: [1]
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan] 
- Use name 'virtio pmem' in place of 'fake dax' 

Changes from PATCH v1: [2]
- 0-day build test for build dependency on libnvdimm 

 Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem  
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c

Changes from RFC v2:
- Add flush function in the nd_region in place of switching
  on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
  for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan

Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent 
  memory and other operations instead of creating an entirely 
  new block driver.
- Use VIRTIO driver to register memory information with 
  nvdimm_bus and create region_type accordingly. 
- Call VIRTIO flush from existing pmem driver.

Pankaj Gupta (5):
   libnvdimm: nd_region flush callback support
   virtio-pmem: Add virtio-pmem guest driver
   libnvdimm: add nd_region buffered dax_dev flag
   ext4: disable map_sync for virtio pmem
   xfs: disable map_sync for virtio pmem

[2] https://lkml.org/lkml/2018/8/31/407
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html  
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=qemu-devel&m=153555721901824&w=2

 drivers/acpi/nfit/core.c         |    4 -
 drivers/dax/super.c              |   17 +++++
 drivers/nvdimm/claim.c           |    6 +
 drivers/nvdimm/nd.h              |    1 
 drivers/nvdimm/pmem.c            |   15 +++-
 drivers/nvdimm/region_devs.c     |   45 +++++++++++++-
 drivers/nvdimm/virtio_pmem.c     |   84 ++++++++++++++++++++++++++
 drivers/virtio/Kconfig           |   10 +++
 drivers/virtio/Makefile          |    1 
 drivers/virtio/pmem.c            |  125 +++++++++++++++++++++++++++++++++++++++
 fs/ext4/file.c                   |   11 +++
 fs/xfs/xfs_file.c                |    8 ++
 include/linux/dax.h              |    9 ++
 include/linux/libnvdimm.h        |   11 +++
 include/linux/virtio_pmem.h      |   60 ++++++++++++++++++
 include/uapi/linux/virtio_ids.h  |    1 
 include/uapi/linux/virtio_pmem.h |   10 +++
 17 files changed, 406 insertions(+), 12 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/5] libnvdimm: nd_region flush callback support
  2019-01-09 13:50 [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
@ 2019-01-09 13:50 ` Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 13:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-fsdevel,
	virtualization, linux-acpi, linux-ext4, linux-xfs
  Cc: jack, stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini,
	zwisler, vishal.l.verma, dave.jiang, david, jmoyer,
	xiaoguangrong.eric, hch, mst, jasowang, lcapitulino, imammedo,
	eblake, willy, tytso, adilger.kernel, darrick.wong, rjw, pagupta

This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.

This also handles asynchronous flush requests from the block layer
by creating a child bio and chaining it with parent bio.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/acpi/nfit/core.c     |  4 ++--
 drivers/nvdimm/claim.c       |  6 ++++--
 drivers/nvdimm/nd.h          |  1 +
 drivers/nvdimm/pmem.c        | 12 ++++++++----
 drivers/nvdimm/region_devs.c | 38 ++++++++++++++++++++++++++++++++++++--
 include/linux/libnvdimm.h    |  5 ++++-
 6 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index b072cfc..f154852 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2234,7 +2234,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 		offset = to_interleave_offset(offset, mmio);
 
 	writeq(cmd, mmio->addr.base + offset);
-	nvdimm_flush(nfit_blk->nd_region);
+	nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
 		readq(mmio->addr.base + offset);
@@ -2283,7 +2283,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 	}
 
 	if (rw)
-		nvdimm_flush(nfit_blk->nd_region);
+		nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
 	return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf..a1dfa06 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
 	sector_t sector = offset >> 9;
-	int rc = 0;
+	int rc = 0, ret = 0;
 
 	if (unlikely(!size))
 		return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	}
 
 	memcpy_flushcache(nsio->addr + offset, buf, size);
-	nvdimm_flush(to_nd_region(ndns->dev.parent));
+	ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL, false);
+	if (ret)
+		rc = ret;
 
 	return rc;
 }
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 98317e7..d53a2d1 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -160,6 +160,7 @@ struct nd_region {
 	struct nd_interleave_set *nd_set;
 	struct nd_percpu_lane __percpu *lane;
 	struct nd_mapping mapping[0];
+	int (*flush)(struct nd_region *nd_region);
 };
 
 struct nd_blk_region {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 6071e29..5d6a4a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 
 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 {
+	int ret = 0;
 	blk_status_t rc = 0;
 	bool do_acct;
 	unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct nd_region *nd_region = to_region(pmem);
 
 	if (bio->bi_opf & REQ_PREFLUSH)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 		nd_iostat_end(bio, start);
 
 	if (bio->bi_opf & REQ_FUA)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
+
+	if (ret)
+		bio->bi_status = errno_to_blk_status(ret);
 
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
@@ -528,14 +532,14 @@ static int nd_pmem_remove(struct device *dev)
 		sysfs_put(pmem->bb_state);
 		pmem->bb_state = NULL;
 	}
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 
 	return 0;
 }
 
 static void nd_pmem_shutdown(struct device *dev)
 {
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 }
 
 static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fa37afc..5508727 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -290,7 +290,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
 		return rc;
 	if (!flush)
 		return -EINVAL;
-	nvdimm_flush(nd_region);
+	rc = nvdimm_flush(nd_region, NULL, false);
+	if (rc)
+		return rc;
 
 	return len;
 }
@@ -1065,6 +1067,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	dev->of_node = ndr_desc->of_node;
 	nd_region->ndr_size = resource_size(ndr_desc->res);
 	nd_region->ndr_start = ndr_desc->res->start;
+	if (ndr_desc->flush)
+		nd_region->flush = ndr_desc->flush;
+	else
+		nd_region->flush = generic_nvdimm_flush;
+
 	nd_device_register(dev);
 
 	return nd_region;
@@ -1105,11 +1112,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 }
 EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
 
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
+{
+	int rc = 0;
+
+	/* Create child bio for asynchronous flush and chain with
+	 * parent bio. Otherwise directly call nd_region flush.
+	 */
+	if (async && bio->bi_iter.bi_sector != -1) {
+
+		struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+		if (!child)
+			return -ENOMEM;
+		bio_copy_dev(child, bio);
+		child->bi_opf = REQ_PREFLUSH;
+		child->bi_iter.bi_sector = -1;
+		bio_chain(child, bio);
+		submit_bio(child);
+	} else {
+		if (nd_region->flush(nd_region))
+			rc = -EIO;
+	}
+
+	return rc;
+}
 /**
  * nvdimm_flush - flush any posted write queues between the cpu and pmem media
  * @nd_region: blk or interleaved pmem region
  */
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
 {
 	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
 	int i, idx;
@@ -1133,6 +1165,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 		if (ndrd_get_flush_wpq(ndrd, i, 0))
 			writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
 	wmb();
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(nvdimm_flush);
 
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 097072c..b49632c 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -115,6 +115,7 @@ struct nd_mapping_desc {
 	int position;
 };
 
+struct nd_region;
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping_desc *mapping;
@@ -126,6 +127,7 @@ struct nd_region_desc {
 	int numa_node;
 	unsigned long flags;
 	struct device_node *of_node;
+	int (*flush)(struct nd_region *nd_region);
 };
 
 struct device;
@@ -201,7 +203,8 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
 unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
 void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
+int generic_nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
 int nvdimm_has_cache(struct nd_region *nd_region);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/5]  virtio-pmem: Add virtio pmem driver
  2019-01-09 13:50 [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 1/5] libnvdimm: nd_region flush callback support Pankaj Gupta
@ 2019-01-09 13:50 ` Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag Pankaj Gupta
  2019-01-09 14:46 ` [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
  3 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 13:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-fsdevel,
	virtualization, linux-acpi, linux-ext4, linux-xfs
  Cc: jack, stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini,
	zwisler, vishal.l.verma, dave.jiang, david, jmoyer,
	xiaoguangrong.eric, hch, mst, jasowang, lcapitulino, imammedo,
	eblake, willy, tytso, adilger.kernel, darrick.wong, rjw, pagupta

This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/nvdimm/virtio_pmem.c     |  84 ++++++++++++++++++++++++++
 drivers/virtio/Kconfig           |  10 ++++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_pmem.h      |  60 +++++++++++++++++++
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  10 ++++
 7 files changed, 290 insertions(+)
 create mode 100644 drivers/nvdimm/virtio_pmem.c
 create mode 100644 drivers/virtio/pmem.c
 create mode 100644 include/linux/virtio_pmem.h
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 0000000..2a1b1ba
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+	unsigned int len;
+	unsigned long flags;
+	struct virtio_pmem_request *req, *req_buf;
+	struct virtio_pmem *vpmem = vq->vdev->priv;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+		req->done = true;
+		wake_up(&req->host_acked);
+
+		if (!list_empty(&vpmem->req_list)) {
+			req_buf = list_first_entry(&vpmem->req_list,
+					struct virtio_pmem_request, list);
+			list_del(&vpmem->req_list);
+			req_buf->wq_buf_avail = true;
+			wake_up(&req_buf->wq_buf);
+		}
+	}
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+	int err;
+	unsigned long flags;
+	struct scatterlist *sgs[2], sg, ret;
+	struct virtio_device *vdev = nd_region->provider_data;
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct virtio_pmem_request *req;
+
+	might_sleep();
+	req = kmalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->done = req->wq_buf_avail = false;
+	strcpy(req->name, "FLUSH");
+	init_waitqueue_head(&req->host_acked);
+	init_waitqueue_head(&req->wq_buf);
+	sg_init_one(&sg, req->name, strlen(req->name));
+	sgs[0] = &sg;
+	sg_init_one(&ret, &req->ret, sizeof(req->ret));
+	sgs[1] = &ret;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+	if (err) {
+		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+		list_add_tail(&vpmem->req_list, &req->list);
+		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+		/* When host has read buffer, this completes via host_ack */
+		wait_event(req->wq_buf, req->wq_buf_avail);
+		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	}
+	virtqueue_kick(vpmem->req_vq);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	/* When host has read buffer, this completes via host_ack */
+	wait_event(req->host_acked, req->done);
+	err = req->ret;
+	kfree(req);
+
+	return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3589764..9f634a2 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Support for virtio pmem driver"
+	depends on VIRTIO
+	depends on LIBNVDIMM
+	help
+	This driver provides support for virtio based flushing interface
+	for persistent memory range.
+
+	If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5..143ce91 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
new file mode 100644
index 0000000..51f5349
--- /dev/null
+++ b/drivers/virtio/pmem.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include <../../drivers/nvdimm/nd.h>
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+				host_ack, "flush_queue");
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	spin_lock_init(&vpmem->pmem_lock);
+	INIT_LIST_HEAD(&vpmem->req_list);
+
+	return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	int err = 0;
+	struct resource res;
+	struct virtio_pmem *vpmem;
+	struct nvdimm_bus *nvdimm_bus;
+	struct nd_region_desc ndr_desc;
+	int nid = dev_to_node(&vdev->dev);
+	struct nd_region *nd_region;
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+			GFP_KERNEL);
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out_err;
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+
+	res.start = vpmem->start;
+	res.end   = vpmem->start + vpmem->size-1;
+	vpmem->nd_desc.provider_name = "virtio-pmem";
+	vpmem->nd_desc.module = THIS_MODULE;
+
+	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+						&vpmem->nd_desc);
+	if (!nvdimm_bus)
+		goto out_vq;
+
+	dev_set_drvdata(&vdev->dev, nvdimm_bus);
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+	ndr_desc.res = &res;
+	ndr_desc.numa_node = nid;
+	ndr_desc.flush = virtio_pmem_flush;
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+	nd_region->provider_data =  dev_to_virtio
+					(nd_region->dev.parent->parent);
+
+	if (!nd_region)
+		goto out_nd;
+
+	//virtio_device_ready(vdev);
+	return 0;
+out_nd:
+	err = -ENXIO;
+	nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+	vdev->config->del_vqs(vdev);
+out_err:
+	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+	return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
+	kfree(vpmem);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	.remove			= virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 0000000..224f9d9
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+	/* Host return status corresponding to flush request */
+	int ret;
+
+	/* command name*/
+	char name[16];
+
+	/* Wait queue to process deferred work after ack from host */
+	wait_queue_head_t host_acked;
+	bool done;
+
+	/* Wait queue to process deferred work after virt queue buffer avail */
+	wait_queue_head_t wq_buf;
+	bool wq_buf_avail;
+	struct list_head list;
+};
+
+struct virtio_pmem {
+	struct virtio_device *vdev;
+
+	/* Virtio pmem request queue */
+	struct virtqueue *req_vq;
+
+	/* nvdimm bus registers virtio pmem device */
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm_bus_descriptor nd_desc;
+
+	/* List to store deferred work if virtqueue is full */
+	struct list_head req_list;
+
+	/* Synchronize virtqueue data */
+	spinlock_t pmem_lock;
+
+	/* Memory region information */
+	uint64_t start;
+	uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int virtio_pmem_flush(struct nd_region *nd_region);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2..3463895 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         25 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 0000000..fa3f7d5
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+	__le64 start;
+	__le64 size;
+};
+#endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag
  2019-01-09 13:50 [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 1/5] libnvdimm: nd_region flush callback support Pankaj Gupta
  2019-01-09 13:50 ` [PATCH v3 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
@ 2019-01-09 13:50 ` Pankaj Gupta
  2019-01-09 17:02   ` Dan Williams
  2019-01-09 14:46 ` [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
  3 siblings, 1 reply; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 13:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-fsdevel,
	virtualization, linux-acpi, linux-ext4, linux-xfs
  Cc: jack, stefanha, dan.j.williams, riel, nilal, kwolf, pbonzini,
	zwisler, vishal.l.verma, dave.jiang, david, jmoyer,
	xiaoguangrong.eric, hch, mst, jasowang, lcapitulino, imammedo,
	eblake, willy, tytso, adilger.kernel, darrick.wong, rjw, pagupta

This patch adds 'DAXDEV_BUFFERED' flag which is set
for virtio pmem corresponding nd_region. This later
is used to disable MAP_SYNC functionality for ext4
& xfs filesystem.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/dax/super.c          | 17 +++++++++++++++++
 drivers/nvdimm/pmem.c        |  3 +++
 drivers/nvdimm/region_devs.c |  7 +++++++
 drivers/virtio/pmem.c        |  1 +
 include/linux/dax.h          |  9 +++++++++
 include/linux/libnvdimm.h    |  6 ++++++
 6 files changed, 43 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6e928f3..9128740 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,8 @@ enum dax_device_flags {
 	DAXDEV_ALIVE,
 	/* gate whether dax_flush() calls the low level flush routine */
 	DAXDEV_WRITE_CACHE,
+	/* flag to disable MAP_SYNC for virtio based host page cache flush */
+	DAXDEV_BUFFERED,
 };
 
 /**
@@ -335,6 +337,21 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
 
+void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc)
+{
+	if (wc)
+		set_bit(DAXDEV_BUFFERED, &dax_dev->flags);
+	else
+		clear_bit(DAXDEV_BUFFERED, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_host_cache);
+
+bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev)
+{
+	return test_bit(DAXDEV_BUFFERED, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_host_cache_enabled);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
 	lockdep_assert_held(&dax_srcu);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fe1217b..8d190a3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -472,6 +472,9 @@ static int pmem_attach_disk(struct device *dev,
 		return -ENOMEM;
 	}
 	dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+
+	/* Set buffered bit in 'dax_dev' for virtio pmem */
+	virtio_pmem_host_cache(dax_dev, nvdimm_is_buffered(nd_region));
 	pmem->dax_dev = dax_dev;
 
 	gendev = disk_to_dev(disk);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index f8218b4..1f8b2be 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1264,6 +1264,13 @@ int nd_region_conflict(struct nd_region *nd_region, resource_size_t start,
 	return device_for_each_child(&nvdimm_bus->dev, &ctx, region_conflict);
 }
 
+int nvdimm_is_buffered(struct nd_region *nd_region)
+{
+	return is_nd_pmem(&nd_region->dev) &&
+		test_bit(ND_REGION_BUFFERED, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(nvdimm_is_buffered);
+
 void __exit nd_region_devs_exit(void)
 {
 	ida_destroy(&region_ida);
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
index 51f5349..901767b 100644
--- a/drivers/virtio/pmem.c
+++ b/drivers/virtio/pmem.c
@@ -81,6 +81,7 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 	ndr_desc.numa_node = nid;
 	ndr_desc.flush = virtio_pmem_flush;
 	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	set_bit(ND_REGION_BUFFERED, &ndr_desc.flags);
 	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
 	nd_region->provider_data =  dev_to_virtio
 					(nd_region->dev.parent->parent);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a..d16e03e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -37,6 +37,8 @@ void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
+void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc);
+bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev);
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
@@ -64,6 +66,13 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
 {
 	return false;
 }
+static inline void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc)
+{
+}
+static inline bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev)
+{
+	return false;
+}
 #endif
 
 struct writeback_control;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index ca8bc07..94616f1 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -64,6 +64,11 @@ enum {
 	 */
 	ND_REGION_PERSIST_MEMCTRL = 2,
 
+	/* provides virtio based asynchronous flush mechanism for buffered
+	 * host page cache.
+	 */
+	ND_REGION_BUFFERED = 3,
+
 	/* mark newly adjusted resources as requiring a label update */
 	DPA_RESOURCE_ADJUSTED = 1 << 0,
 };
@@ -265,6 +270,7 @@ int generic_nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
 int nvdimm_has_cache(struct nd_region *nd_region);
 int nvdimm_in_overwrite(struct nvdimm *nvdimm);
+int nvdimm_is_buffered(struct nd_region *nd_region);
 
 static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
 		unsigned int buf_len, int *cmd_rc)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
  2019-01-09 13:50 [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
                   ` (2 preceding siblings ...)
  2019-01-09 13:50 ` [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag Pankaj Gupta
@ 2019-01-09 14:46 ` Pankaj Gupta
  3 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 14:46 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-fsdevel,
	virtualization, linux-acpi, linux-ext4, linux-xfs
  Cc: jack, david, jasowang, lcapitulino, adilger kernel, zwisler,
	dave jiang, darrick wong, vishal l verma, mst, willy, hch,
	jmoyer, nilal, riel, stefanha, imammedo, dan j williams, kwolf,
	tytso, xiaoguangrong eric, rjw, pbonzini



Please ignore this series as my network went down while 
sending this. I will send this series again.

Thanks,
Pankaj  

> 
> This patch series has implementation for "virtio pmem".
>  "virtio pmem" is fake persistent memory(nvdimm) in guest
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.
>  
>  Sharing guest kernel driver in this patchset with the
>  changes suggested in v2. Tested with Qemu side device
>  emulation for virtio-pmem [6].
>  
>  Details of project idea for 'virtio pmem' flushing interface
>  is shared [3] & [4].
> 
>  Implementation is divided into two parts:
>  New virtio pmem guest driver and qemu code changes for new
>  virtio pmem paravirtualized device.
> 
> 1. Guest virtio-pmem kernel driver
> ---------------------------------
>    - Reads persistent memory range from paravirt device and
>      registers with 'nvdimm_bus'.
>    - 'nvdimm/pmem' driver uses this information to allocate
>      persistent memory region and setup filesystem operations
>      to the allocated memory.
>    - virtio pmem driver implements asynchronous flushing
>      interface to flush from guest to host.
> 
> 2. Qemu virtio-pmem device
> ---------------------------------
>    - Creates virtio pmem device and exposes a memory range to
>      KVM guest.
>    - At host side this is file backed memory which acts as
>      persistent memory.
>    - Qemu side flush uses aio thread pool API's and virtio
>      for asynchronous guest multi request handling.
> 
>    David Hildenbrand CCed also posted a modified version[6] of
>    qemu virtio-pmem code based on updated Qemu memory device API.
> 
>  Virtio-pmem errors handling:
>  ----------------------------------------
>   Checked behaviour of virtio-pmem for below types of errors
>   Need suggestions on expected behaviour for handling these errors?
> 
>   - Hardware Errors: Uncorrectable recoverable Errors:
>   a] virtio-pmem:
>     - As per current logic if error page belongs to Qemu process,
>       host MCE handler isolates(hwpoison) that page and send SIGBUS.
>       Qemu SIGBUS handler injects exception to KVM guest.
>     - KVM guest then isolates the page and send SIGBUS to guest
>       userspace process which has mapped the page.
>   
>   b] Existing implementation for ACPI pmem driver:
>     - Handles such errors with MCE notifier and creates a list
>       of bad blocks. Read/direct access DAX operation return EIO
>       if accessed memory page fall in bad block list.
>     - It also starts backgound scrubbing.
>     - Similar functionality can be reused in virtio-pmem with MCE
>       notifier but without scrubbing(no ACPI/ARS)? Need inputs to
>       confirm if this behaviour is ok or needs any change?
> 
> Changes from PATCH v2: [1]
> - Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
> - Use name 'virtio pmem' in place of 'fake dax'
> 
> Changes from PATCH v1: [2]
> - 0-day build test for build dependency on libnvdimm
> 
>  Changes suggested by - [Dan Williams]
> - Split the driver into two parts virtio & pmem
> - Move queuing of async block request to block layer
> - Add "sync" parameter in nvdimm_flush function
> - Use indirect call for nvdimm_flush
> - Don’t move declarations to common global header e.g nd.h
> - nvdimm_flush() return 0 or -EIO if it fails
> - Teach nsio_rw_bytes() that the flush can fail
> - Rename nvdimm_flush() to generic_nvdimm_flush()
> - Use 'nd_region->provider_data' for long dereferencing
> - Remove virtio_pmem_freeze/restore functions
> - Remove BSD license text with SPDX license text
> 
> - Add might_sleep() in virtio_pmem_flush - [Luiz]
> - Make spin_lock_irqsave() narrow
> 
> Changes from RFC v3
> - Rebase to latest upstream - Luiz
> - Call ndregion->flush in place of nvdimm_flush- Luiz
> - kmalloc return check - Luiz
> - virtqueue full handling - Stefan
> - Don't map entire virtio_pmem_req to device - Stefan
> - request leak, correct sizeof req- Stefan
> - Move declaration to virtio_pmem.c
> 
> Changes from RFC v2:
> - Add flush function in the nd_region in place of switching
>   on a flag - Dan & Stefan
> - Add flush completion function with proper locking and wait
>   for host side flush completion - Stefan & Dan
> - Keep userspace API in uapi header file - Stefan, MST
> - Use LE fields & New device id - MST
> - Indentation & spacing suggestions - MST & Eric
> - Remove extra header files & add licensing - Stefan
> 
> Changes from RFC v1:
> - Reuse existing 'pmem' code for registering persistent
>   memory and other operations instead of creating an entirely
>   new block driver.
> - Use VIRTIO driver to register memory information with
>   nvdimm_bus and create region_type accordingly.
> - Call VIRTIO flush from existing pmem driver.
> 
> Pankaj Gupta (5):
>    libnvdimm: nd_region flush callback support
>    virtio-pmem: Add virtio-pmem guest driver
>    libnvdimm: add nd_region buffered dax_dev flag
>    ext4: disable map_sync for virtio pmem
>    xfs: disable map_sync for virtio pmem
> 
> [2] https://lkml.org/lkml/2018/8/31/407
> [3] https://www.spinics.net/lists/kvm/msg149761.html
> [4] https://www.spinics.net/lists/kvm/msg153095.html
> [5] https://lkml.org/lkml/2018/8/31/413
> [6] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
> 
>  drivers/acpi/nfit/core.c         |    4 -
>  drivers/dax/super.c              |   17 +++++
>  drivers/nvdimm/claim.c           |    6 +
>  drivers/nvdimm/nd.h              |    1
>  drivers/nvdimm/pmem.c            |   15 +++-
>  drivers/nvdimm/region_devs.c     |   45 +++++++++++++-
>  drivers/nvdimm/virtio_pmem.c     |   84 ++++++++++++++++++++++++++
>  drivers/virtio/Kconfig           |   10 +++
>  drivers/virtio/Makefile          |    1
>  drivers/virtio/pmem.c            |  125
>  +++++++++++++++++++++++++++++++++++++++
>  fs/ext4/file.c                   |   11 +++
>  fs/xfs/xfs_file.c                |    8 ++
>  include/linux/dax.h              |    9 ++
>  include/linux/libnvdimm.h        |   11 +++
>  include/linux/virtio_pmem.h      |   60 ++++++++++++++++++
>  include/uapi/linux/virtio_ids.h  |    1
>  include/uapi/linux/virtio_pmem.h |   10 +++
>  17 files changed, 406 insertions(+), 12 deletions(-)
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag
  2019-01-09 13:50 ` [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag Pankaj Gupta
@ 2019-01-09 17:02   ` Dan Williams
  2019-01-09 18:21     ` Pankaj Gupta
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Williams @ 2019-01-09 17:02 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Linux Kernel Mailing List, KVM list, Qemu Developers,
	linux-nvdimm, linux-fsdevel, virtualization, Linux ACPI,
	linux-ext4, linux-xfs, Jan Kara, Stefan Hajnoczi, Rik van Riel,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Ross Zwisler,
	Vishal L Verma, Dave Jiang, David Hildenbrand, jmoyer,
	Xiao Guangrong, Christoph Hellwig, Michael S. Tsirkin,
	Jason Wang, lcapitulino, Igor Mammedov, Eric Blake,
	Matthew Wilcox, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Rafael J. Wysocki

On Wed, Jan 9, 2019 at 5:53 AM Pankaj Gupta <pagupta@redhat.com> wrote:
>
> This patch adds 'DAXDEV_BUFFERED' flag which is set
> for virtio pmem corresponding nd_region. This later
> is used to disable MAP_SYNC functionality for ext4
> & xfs filesystem.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/dax/super.c          | 17 +++++++++++++++++
>  drivers/nvdimm/pmem.c        |  3 +++
>  drivers/nvdimm/region_devs.c |  7 +++++++
>  drivers/virtio/pmem.c        |  1 +
>  include/linux/dax.h          |  9 +++++++++
>  include/linux/libnvdimm.h    |  6 ++++++
>  6 files changed, 43 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 6e928f3..9128740 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -167,6 +167,8 @@ enum dax_device_flags {
>         DAXDEV_ALIVE,
>         /* gate whether dax_flush() calls the low level flush routine */
>         DAXDEV_WRITE_CACHE,
> +       /* flag to disable MAP_SYNC for virtio based host page cache flush */
> +       DAXDEV_BUFFERED,
>  };
>
>  /**
> @@ -335,6 +337,21 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
>
> +void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc)
> +{
> +       if (wc)
> +               set_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +       else
> +               clear_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache);

The "write_cache" property was structured this way because it can
conceivably change at runtime. The MAP_SYNC capability should be
static and never changed after init.

> +bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev)
> +{
> +       return test_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache_enabled);

Echoing Darrick and Jan this is should be a generic property of a
dax_device and not specific to virtio. I don't like the "buffered"
designation as that's not accurate. There may be hardware reasons why
a dax_device is not synchronous, like a requirement to flush a
write-pending queue or otherwise notify the device of new writes.

I would just have a dax_synchronous() helper and a DAXDEV_SYNC flag. I
would also modify alloc_dax() to take a flags argument so that the
capability can be instantiated when the dax_device is allocated.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag
  2019-01-09 17:02   ` Dan Williams
@ 2019-01-09 18:21     ` Pankaj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-09 18:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, KVM list, Qemu Developers,
	linux-nvdimm, linux-fsdevel, virtualization, Linux ACPI,
	linux-ext4, linux-xfs, Jan Kara, Stefan Hajnoczi, Rik van Riel,
	Nitesh Narayan Lal, Kevin Wolf, Paolo Bonzini, Ross Zwisler,
	Vishal L Verma, Dave Jiang, David Hildenbrand, jmoyer,
	Xiao Guangrong, Christoph Hellwig, Michael S. Tsirkin,
	Jason Wang, lcapitulino, Igor Mammedov, Eric Blake,
	Matthew Wilcox, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Rafael J. Wysocki


> >
> > This patch adds 'DAXDEV_BUFFERED' flag which is set
> > for virtio pmem corresponding nd_region. This later
> > is used to disable MAP_SYNC functionality for ext4
> > & xfs filesystem.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/dax/super.c          | 17 +++++++++++++++++
> >  drivers/nvdimm/pmem.c        |  3 +++
> >  drivers/nvdimm/region_devs.c |  7 +++++++
> >  drivers/virtio/pmem.c        |  1 +
> >  include/linux/dax.h          |  9 +++++++++
> >  include/linux/libnvdimm.h    |  6 ++++++
> >  6 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 6e928f3..9128740 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -167,6 +167,8 @@ enum dax_device_flags {
> >         DAXDEV_ALIVE,
> >         /* gate whether dax_flush() calls the low level flush routine */
> >         DAXDEV_WRITE_CACHE,
> > +       /* flag to disable MAP_SYNC for virtio based host page cache flush
> > */
> > +       DAXDEV_BUFFERED,
> >  };
> >
> >  /**
> > @@ -335,6 +337,21 @@ bool dax_write_cache_enabled(struct dax_device
> > *dax_dev)
> >  }
> >  EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
> >
> > +void virtio_pmem_host_cache(struct dax_device *dax_dev, bool wc)
> > +{
> > +       if (wc)
> > +               set_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> > +       else
> > +               clear_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache);
> 
> The "write_cache" property was structured this way because it can
> conceivably change at runtime. The MAP_SYNC capability should be
> static and never changed after init.

o.k. Will change.

> 
> > +bool virtio_pmem_host_cache_enabled(struct dax_device *dax_dev)
> > +{
> > +       return test_bit(DAXDEV_BUFFERED, &dax_dev->flags);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_pmem_host_cache_enabled);
> 
> Echoing Darrick and Jan this is should be a generic property of a
> dax_device and not specific to virtio. I don't like the "buffered"
> designation as that's not accurate. There may be hardware reasons why
> a dax_device is not synchronous, like a requirement to flush a
> write-pending queue or otherwise notify the device of new writes.

Agree.

> 
> I would just have a dax_synchronous() helper and a DAXDEV_SYNC flag. I
> would also modify alloc_dax() to take a flags argument so that the
> capability can be instantiated when the dax_device is allocated.

o.k. Will make the change.

Thanks,
Pankaj
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
       [not found]               ` <20190114205031-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2019-01-15  5:37                 ` Pankaj Gupta
  2019-01-15  5:37                   ` Pankaj Gupta
  0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-15  5:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jan Kara, KVM list, David Hildenbrand, linux-nvdimm, Jason Wang,
	Dave Chinner, Qemu Developers,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	adilger kernel, Ross Zwisler, darrick wong, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, linux-ext4, Rik van Riel,
	Stefan Hajnoczi, Paolo Bonzini,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric,
	Rafael J. Wysocki


> > > >
> > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > > > >
> > > > > > > Until you have images (and hence host page cache) shared between
> > > > > > > multiple guests. People will want to do this, because it means
> > > > > > > they
> > > > > > > only need a single set of pages in host memory for executable
> > > > > > > binaries rather than a set of pages per guest. Then you have
> > > > > > > multiple guests being able to detect residency of the same set of
> > > > > > > pages. If the guests can then, in any way, control eviction of
> > > > > > > the
> > > > > > > pages from the host cache, then we have a guest-to-guest
> > > > > > > information
> > > > > > > leak channel.
> > > > > >
> > > > > > I don't think we should ever be considering something that would
> > > > > > allow a
> > > > > > guest to evict page's from the host's pagecache [1].  The guest
> > > > > > should
> > > > > > be able to kick its own references to the host's pagecache out of
> > > > > > its
> > > > > > own pagecache, but not be able to influence whether the host or
> > > > > > another
> > > > > > guest has a read-only mapping cached.
> > > > > >
> > > > > > [1] Unless the guest is allowed to modify the host's file;
> > > > > > obviously
> > > > > > truncation, holepunching, etc are going to evict pages from the
> > > > > > host's
> > > > > > page cache.
> > > > >
> > > > > This is so correct. Guest does not not evict host page cache pages
> > > > > directly.
> > > >
> > > > They don't right now.
> > > >
> > > > But someone is going to end up asking for discard to work so that
> > > > the guest can free unused space in the underlying spares image (i.e.
> > > > make use of fstrim or mount -o discard) because they have workloads
> > > > that have bursts of space usage and they need to trim the image
> > > > files afterwards to keep their overall space usage under control.
> > > >
> > > > And then....
> > > 
> > > ...we reject / push back on that patch citing the above concern.
> > 
> > So at what point do we draw the line?
> > 
> > We're allowing writable DAX mappings, but as I've pointed out that
> > means we are going to be allowing  a potential information leak via
> > files with shared extents to be directly mapped and written to.
> > 
> > But we won't allow useful admin operations that allow better
> > management of host side storage space similar to how normal image
> > files are used by guests because it's an information leak vector?
> > 
> > That's splitting some really fine hairs there...
> 
> May I summarize that th security implications need to
> be documented?
> 
> In fact that would make a fine security implications section
> in the device specification.

This is a very good suggestion. 

I will document the security implications in details in device specification
with details of what all filesystem features we don't support and why.

Best regards,
Pankaj

> 
> 
> 
> 
> 
> > > > > In case of virtio-pmem & DAX, guest clears guest page cache
> > > > > exceptional entries.
> > > > > Its solely decision of host to take action on the host page cache
> > > > > pages.
> > > > >
> > > > > In case of virtio-pmem, guest does not modify host file directly i.e
> > > > > don't
> > > > > perform hole punch & truncation operation directly on host file.
> > > >
> > > > ... this will no longer be true, and the nuclear landmine in this
> > > > driver interface will have been armed....
> > > 
> > > I agree with the need to be careful when / if explicit cache control
> > > is added, but that's not the case today.
> > 
> > "if"?
> > 
> > I expect it to be "when", not if. Expect the worst, plan for it now.
> > 
> > Cheers,
> > 
> > Dave.
> > --
> > Dave Chinner
> > david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
  2019-01-15  5:37                 ` [Qemu-devel] " Pankaj Gupta
@ 2019-01-15  5:37                   ` Pankaj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-15  5:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dave Chinner, Jan Kara, KVM list, David Hildenbrand,
	linux-nvdimm, Jason Wang, Qemu Developers, virtualization,
	adilger kernel, Ross Zwisler, dave jiang, darrick wong,
	vishal l verma, Matthew Wilcox, Christoph Hellwig, Linux ACPI,
	jmoyer, linux-ext4, Rik van Riel, Stefan Hajnoczi, Igor Mammedov,
	Dan Williams, lcapitulino, Kevin Wolf, Nitesh Narayan Lal,
	Theodore Ts'o, xiaoguangrong eric, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Paolo Bonzini


> > > >
> > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > > > >
> > > > > > > Until you have images (and hence host page cache) shared between
> > > > > > > multiple guests. People will want to do this, because it means
> > > > > > > they
> > > > > > > only need a single set of pages in host memory for executable
> > > > > > > binaries rather than a set of pages per guest. Then you have
> > > > > > > multiple guests being able to detect residency of the same set of
> > > > > > > pages. If the guests can then, in any way, control eviction of
> > > > > > > the
> > > > > > > pages from the host cache, then we have a guest-to-guest
> > > > > > > information
> > > > > > > leak channel.
> > > > > >
> > > > > > I don't think we should ever be considering something that would
> > > > > > allow a
> > > > > > guest to evict page's from the host's pagecache [1].  The guest
> > > > > > should
> > > > > > be able to kick its own references to the host's pagecache out of
> > > > > > its
> > > > > > own pagecache, but not be able to influence whether the host or
> > > > > > another
> > > > > > guest has a read-only mapping cached.
> > > > > >
> > > > > > [1] Unless the guest is allowed to modify the host's file;
> > > > > > obviously
> > > > > > truncation, holepunching, etc are going to evict pages from the
> > > > > > host's
> > > > > > page cache.
> > > > >
> > > > > This is so correct. Guest does not not evict host page cache pages
> > > > > directly.
> > > >
> > > > They don't right now.
> > > >
> > > > But someone is going to end up asking for discard to work so that
> > > > the guest can free unused space in the underlying spares image (i.e.
> > > > make use of fstrim or mount -o discard) because they have workloads
> > > > that have bursts of space usage and they need to trim the image
> > > > files afterwards to keep their overall space usage under control.
> > > >
> > > > And then....
> > > 
> > > ...we reject / push back on that patch citing the above concern.
> > 
> > So at what point do we draw the line?
> > 
> > We're allowing writable DAX mappings, but as I've pointed out that
> > means we are going to be allowing  a potential information leak via
> > files with shared extents to be directly mapped and written to.
> > 
> > But we won't allow useful admin operations that allow better
> > management of host side storage space similar to how normal image
> > files are used by guests because it's an information leak vector?
> > 
> > That's splitting some really fine hairs there...
> 
> May I summarize that th security implications need to
> be documented?
> 
> In fact that would make a fine security implications section
> in the device specification.

This is a very good suggestion. 

I will document the security implications in details in device specification
with details of what all filesystem features we don't support and why.

Best regards,
Pankaj

> 
> 
> 
> 
> 
> > > > > In case of virtio-pmem & DAX, guest clears guest page cache
> > > > > exceptional entries.
> > > > > Its solely decision of host to take action on the host page cache
> > > > > pages.
> > > > >
> > > > > In case of virtio-pmem, guest does not modify host file directly i.e
> > > > > don't
> > > > > perform hole punch & truncation operation directly on host file.
> > > >
> > > > ... this will no longer be true, and the nuclear landmine in this
> > > > driver interface will have been armed....
> > > 
> > > I agree with the need to be careful when / if explicit cache control
> > > is added, but that's not the case today.
> > 
> > "if"?
> > 
> > I expect it to be "when", not if. Expect the worst, plan for it now.
> > 
> > Cheers,
> > 
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
       [not found]                 ` <20190114095520.GC13316-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2019-01-14 10:16                   ` Pankaj Gupta
  2019-01-14 10:16                     ` Pankaj Gupta
  0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-14 10:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: KVM list, Michael S. Tsirkin, linux-nvdimm, Jason Wang,
	Dave Chinner, Qemu Developers,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	adilger kernel, Ross Zwisler, darrick wong, David Hildenbrand,
	Matthew Wilcox, Christoph Hellwig, Linux ACPI, linux-ext4,
	Rik van Riel, Stefan Hajnoczi, Paolo Bonzini,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric,


> > > > > Right. Thinking about this I would be more concerned about the fact
> > > > > that
> > > > > guest can effectively pin amount of host's page cache upto size of
> > > > > the
> > > > > device/file passed to guest as PMEM, can't it Pankaj? Or is there
> > > > > some
> > > > > QEMU
> > > > > magic that avoids this?
> > > >
> > > > Yes, guest will pin these host page cache pages using 'get_user_pages'
> > > > by
> > > > elevating the page reference count. But these pages can be reclaimed by
> > > > host
> > > > at any time when there is memory pressure.
> > > 
> > > Wait, how can the guest pin the host pages? I would expect this to
> > > happen only when using vfio and device assignment. Otherwise, no the
> > > host can't reclaim a pinned page, that's the whole point of a pin to
> > > prevent the mm from reclaiming ownership.
> > 
> > yes. You are right I just used the pin word but it does not actually pin
> > pages
> > permanently. I had gone through the discussion on existing problems with
> > get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP
> > pin pages so I also used the word 'pin'. But guest does not permanently pin
> > these pages and these pages can be reclaimed by host.
> 
> OK, then I was just confused how virtio-pmem is going to work. Thanks for
> explanation! So can I imagine this as guest mmaping the host file and
> providing the mapped range as "NVDIMM pages" to the kernel inside the
> guest? Or is it more complex?

yes, that's correct. Host's Qemu process virtual address range is used as guest physical 
address and a direct mapping(EPT/NPT) is established. At guest side, this physical memory
range is plugged into guest system memory map and DAX mapping is setup using nvdimm calls.

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
  2019-01-14 10:16                   ` Pankaj Gupta
@ 2019-01-14 10:16                     ` Pankaj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-14 10:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, KVM list, David Hildenbrand, linux-nvdimm,
	Jason Wang, Dave Chinner, Qemu Developers, virtualization,
	adilger kernel, Ross Zwisler, dave jiang, darrick wong,
	vishal l verma, Michael S. Tsirkin, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, jmoyer, linux-ext4, Rik van Riel,
	Stefan Hajnoczi, Igor Mammedov, lcapitulino, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-xfs,
	linux-fsdevel, Paolo Bonzini


> > > > > Right. Thinking about this I would be more concerned about the fact
> > > > > that
> > > > > guest can effectively pin amount of host's page cache upto size of
> > > > > the
> > > > > device/file passed to guest as PMEM, can't it Pankaj? Or is there
> > > > > some
> > > > > QEMU
> > > > > magic that avoids this?
> > > >
> > > > Yes, guest will pin these host page cache pages using 'get_user_pages'
> > > > by
> > > > elevating the page reference count. But these pages can be reclaimed by
> > > > host
> > > > at any time when there is memory pressure.
> > > 
> > > Wait, how can the guest pin the host pages? I would expect this to
> > > happen only when using vfio and device assignment. Otherwise, no the
> > > host can't reclaim a pinned page, that's the whole point of a pin to
> > > prevent the mm from reclaiming ownership.
> > 
> > yes. You are right I just used the pin word but it does not actually pin
> > pages
> > permanently. I had gone through the discussion on existing problems with
> > get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP
> > pin pages so I also used the word 'pin'. But guest does not permanently pin
> > these pages and these pages can be reclaimed by host.
> 
> OK, then I was just confused how virtio-pmem is going to work. Thanks for
> explanation! So can I imagine this as guest mmaping the host file and
> providing the mapped range as "NVDIMM pages" to the kernel inside the
> guest? Or is it more complex?

yes, that's correct. Host's Qemu process virtual address range is used as guest physical 
address and a direct mapping(EPT/NPT) is established. At guest side, this physical memory
range is plugged into guest system memory map and DAX mapping is setup using nvdimm calls.

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
       [not found]             ` <540171952.63371441.1547345866585.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2019-01-14  9:55               ` Jan Kara
  2019-01-14  9:55                 ` Jan Kara
       [not found]                 ` <20190114095520.GC13316-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kara @ 2019-01-14  9:55 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Jan Kara, KVM list, Michael S. Tsirkin, linux-nvdimm, Jason Wang,
	Dave Chinner, Qemu Developers,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	adilger kernel, Ross Zwisler, darrick wong, David Hildenbrand,
	Matthew Wilcox, Christoph Hellwig, Linux ACPI, linux-ext4,
	Rik van Riel, Stefan Hajnoczi, Paolo Bonzini,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric

On Sat 12-01-19 21:17:46, Pankaj Gupta wrote:
> > > > Right. Thinking about this I would be more concerned about the fact that
> > > > guest can effectively pin amount of host's page cache upto size of the
> > > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some
> > > > QEMU
> > > > magic that avoids this?
> > >
> > > Yes, guest will pin these host page cache pages using 'get_user_pages' by
> > > elevating the page reference count. But these pages can be reclaimed by
> > > host
> > > at any time when there is memory pressure.
> > 
> > Wait, how can the guest pin the host pages? I would expect this to
> > happen only when using vfio and device assignment. Otherwise, no the
> > host can't reclaim a pinned page, that's the whole point of a pin to
> > prevent the mm from reclaiming ownership.
> 
> yes. You are right I just used the pin word but it does not actually pin pages 
> permanently. I had gone through the discussion on existing problems with 
> get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP 
> pin pages so I also used the word 'pin'. But guest does not permanently pin 
> these pages and these pages can be reclaimed by host.

OK, then I was just confused how virtio-pmem is going to work. Thanks for
explanation! So can I imagine this as guest mmaping the host file and
providing the mapped range as "NVDIMM pages" to the kernel inside the
guest? Or is it more complex?

								Honza

-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
  2019-01-14  9:55               ` Jan Kara
@ 2019-01-14  9:55                 ` Jan Kara
       [not found]                 ` <20190114095520.GC13316-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Kara @ 2019-01-14  9:55 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Dan Williams, Jan Kara, KVM list, David Hildenbrand,
	linux-nvdimm, Jason Wang, Dave Chinner, Qemu Developers,
	virtualization, adilger kernel, Ross Zwisler, dave jiang,
	darrick wong, vishal l verma, Michael S. Tsirkin, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, jmoyer, linux-ext4, Rik van Riel,
	Stefan Hajnoczi, Igor Mammedov, lcapitulino, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-xfs,
	linux-fsdevel, Paolo Bonzini

On Sat 12-01-19 21:17:46, Pankaj Gupta wrote:
> > > > Right. Thinking about this I would be more concerned about the fact that
> > > > guest can effectively pin amount of host's page cache upto size of the
> > > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some
> > > > QEMU
> > > > magic that avoids this?
> > >
> > > Yes, guest will pin these host page cache pages using 'get_user_pages' by
> > > elevating the page reference count. But these pages can be reclaimed by
> > > host
> > > at any time when there is memory pressure.
> > 
> > Wait, how can the guest pin the host pages? I would expect this to
> > happen only when using vfio and device assignment. Otherwise, no the
> > host can't reclaim a pinned page, that's the whole point of a pin to
> > prevent the mm from reclaiming ownership.
> 
> yes. You are right I just used the pin word but it does not actually pin pages 
> permanently. I had gone through the discussion on existing problems with 
> get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP 
> pin pages so I also used the word 'pin'. But guest does not permanently pin 
> these pages and these pages can be reclaimed by host.

OK, then I was just confused how virtio-pmem is going to work. Thanks for
explanation! So can I imagine this as guest mmaping the host file and
providing the mapped range as "NVDIMM pages" to the kernel inside the
guest? Or is it more complex?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
       [not found]         ` <CAPcyv4hwcgTUpgNCefCGu4DvgkYBp5b=f+hJ+FC=s5APYKoycg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-01-13  2:17           ` Pankaj Gupta
  2019-01-13  2:17             ` Pankaj Gupta
       [not found]             ` <540171952.63371441.1547345866585.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-13  2:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, KVM list, Michael S. Tsirkin, linux-nvdimm, Jason Wang,
	Dave Chinner, Qemu Developers,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	adilger kernel, Ross Zwisler, darrick wong, David Hildenbrand,
	Matthew Wilcox, Christoph Hellwig, Linux ACPI, linux-ext4,
	Rik van Riel, Stefan Hajnoczi, Paolo Bonzini,
	lcapitulino-H+wXaHxf7aLQT0dZR+AlfA, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, xiaoguangrong eric


> >
> >
> >
> > >
> > > On Thu 10-01-19 12:26:17, Dave Chinner wrote:
> > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
> > > > >  This patch series has implementation for "virtio pmem".
> > > > >  "virtio pmem" is fake persistent memory(nvdimm) in guest
> > > > >  which allows to bypass the guest page cache. This also
> > > > >  implements a VIRTIO based asynchronous flush mechanism.
> > > >
> > > > Hmmmm. Sharing the host page cache direct into the guest VM. Sounds
> > > > like a good idea, but.....
> > > >
> > > > This means the guest VM can now run timing attacks to observe host
> > > > side page cache residency, and depending on the implementation I'm
> > > > guessing that the guest will be able to control host side page
> > > > cache eviction, too (e.g. via discard or hole punch operations).
> > > >
> > > > Which means this functionality looks to me like a new vector for
> > > > information leakage into and out of the guest VM via guest
> > > > controlled host page cache manipulation.
> > > >
> > > > https://arxiv.org/pdf/1901.01161
> > > >
> > > > I might be wrong, but if I'm not we're going to have to be very
> > > > careful about how guest VMs can access and manipulate host side
> > > > resources like the page cache.....
> > >
> > > Right. Thinking about this I would be more concerned about the fact that
> > > guest can effectively pin amount of host's page cache upto size of the
> > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some
> > > QEMU
> > > magic that avoids this?
> >
> > Yes, guest will pin these host page cache pages using 'get_user_pages' by
> > elevating the page reference count. But these pages can be reclaimed by
> > host
> > at any time when there is memory pressure.
> 
> Wait, how can the guest pin the host pages? I would expect this to
> happen only when using vfio and device assignment. Otherwise, no the
> host can't reclaim a pinned page, that's the whole point of a pin to
> prevent the mm from reclaiming ownership.

yes. You are right I just used the pin word but it does not actually pin pages 
permanently. I had gone through the discussion on existing problems with 
get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP 
pin pages so I also used the word 'pin'. But guest does not permanently pin 
these pages and these pages can be reclaimed by host.

> 
> > KVM does not permanently pin pages. vfio does that but we are not using
> > it here.
> 
> Right, so I'm confused by your pin assertion above.

Sorry! for the confusion. 

[1] https://lwn.net/Articles/753027/

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
  2019-01-13  2:17           ` [Qemu-devel] " Pankaj Gupta
@ 2019-01-13  2:17             ` Pankaj Gupta
       [not found]             ` <540171952.63371441.1547345866585.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2019-01-13  2:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, KVM list, David Hildenbrand, linux-nvdimm, Jason Wang,
	Dave Chinner, Qemu Developers, virtualization, adilger kernel,
	Ross Zwisler, dave jiang, darrick wong, vishal l verma,
	Michael S. Tsirkin, Matthew Wilcox, Christoph Hellwig,
	Linux ACPI, jmoyer, linux-ext4, Rik van Riel, Stefan Hajnoczi,
	Igor Mammedov, lcapitulino, Kevin Wolf, Nitesh Narayan Lal,
	Theodore Ts'o, xiaoguangrong eric, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Paolo Bonzini


> >
> >
> >
> > >
> > > On Thu 10-01-19 12:26:17, Dave Chinner wrote:
> > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
> > > > >  This patch series has implementation for "virtio pmem".
> > > > >  "virtio pmem" is fake persistent memory(nvdimm) in guest
> > > > >  which allows to bypass the guest page cache. This also
> > > > >  implements a VIRTIO based asynchronous flush mechanism.
> > > >
> > > > Hmmmm. Sharing the host page cache direct into the guest VM. Sounds
> > > > like a good idea, but.....
> > > >
> > > > This means the guest VM can now run timing attacks to observe host
> > > > side page cache residency, and depending on the implementation I'm
> > > > guessing that the guest will be able to control host side page
> > > > cache eviction, too (e.g. via discard or hole punch operations).
> > > >
> > > > Which means this functionality looks to me like a new vector for
> > > > information leakage into and out of the guest VM via guest
> > > > controlled host page cache manipulation.
> > > >
> > > > https://arxiv.org/pdf/1901.01161
> > > >
> > > > I might be wrong, but if I'm not we're going to have to be very
> > > > careful about how guest VMs can access and manipulate host side
> > > > resources like the page cache.....
> > >
> > > Right. Thinking about this I would be more concerned about the fact that
> > > guest can effectively pin amount of host's page cache upto size of the
> > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some
> > > QEMU
> > > magic that avoids this?
> >
> > Yes, guest will pin these host page cache pages using 'get_user_pages' by
> > elevating the page reference count. But these pages can be reclaimed by
> > host
> > at any time when there is memory pressure.
> 
> Wait, how can the guest pin the host pages? I would expect this to
> happen only when using vfio and device assignment. Otherwise, no the
> host can't reclaim a pinned page, that's the whole point of a pin to
> prevent the mm from reclaiming ownership.

yes. You are right I just used the pin word but it does not actually pin pages 
permanently. I had gone through the discussion on existing problems with 
get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP 
pin pages so I also used the word 'pin'. But guest does not permanently pin 
these pages and these pages can be reclaimed by host.

> 
> > KVM does not permanently pin pages. vfio does that but we are not using
> > it here.
> 
> Right, so I'm confused by your pin assertion above.

Sorry! for the confusion. 

[1] https://lwn.net/Articles/753027/

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-01-15  5:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 13:50 [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
2019-01-09 13:50 ` [PATCH v3 1/5] libnvdimm: nd_region flush callback support Pankaj Gupta
2019-01-09 13:50 ` [PATCH v3 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
2019-01-09 13:50 ` [PATCH v3 3/5] libnvdimm: add nd_region buffered dax_dev flag Pankaj Gupta
2019-01-09 17:02   ` Dan Williams
2019-01-09 18:21     ` Pankaj Gupta
2019-01-09 14:46 ` [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device Pankaj Gupta
2019-01-09 14:47 Pankaj Gupta
2019-01-10  1:26 ` Dave Chinner
2019-01-10 10:17   ` Jan Kara
2019-01-13  1:38     ` Pankaj Gupta
2019-01-13  1:43       ` Dan Williams
     [not found]         ` <CAPcyv4hwcgTUpgNCefCGu4DvgkYBp5b=f+hJ+FC=s5APYKoycg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-01-13  2:17           ` [Qemu-devel] " Pankaj Gupta
2019-01-13  2:17             ` Pankaj Gupta
     [not found]             ` <540171952.63371441.1547345866585.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2019-01-14  9:55               ` Jan Kara
2019-01-14  9:55                 ` Jan Kara
     [not found]                 ` <20190114095520.GC13316-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2019-01-14 10:16                   ` Pankaj Gupta
2019-01-14 10:16                     ` Pankaj Gupta
2019-01-13 23:29 ` Dave Chinner
2019-01-13 23:38   ` Matthew Wilcox
2019-01-14  7:15     ` Pankaj Gupta
2019-01-14 21:25       ` Dave Chinner
2019-01-14 21:35         ` Dan Williams
2019-01-14 22:21           ` Dave Chinner
2019-01-15  2:19             ` Michael S. Tsirkin
     [not found]               ` <20190114205031-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2019-01-15  5:37                 ` [Qemu-devel] " Pankaj Gupta
2019-01-15  5:37                   ` Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).