linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/6] virtio pmem driver
@ 2019-04-10  4:08 Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 1/6] libnvdimm: nd_region flush callback support Pankaj Gupta
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

 This patch series has implementation for "virtio pmem". 
 "virtio pmem" is fake persistent memory(nvdimm) in guest 
 which allows to bypass the guest page cache. This also
 implements a VIRTIO based asynchronous flush mechanism.  
 
 Sharing guest kernel driver in this patchset with the 
 changes suggested in v4. Tested with Qemu side device 
 emulation [6] for virtio-pmem. 

 We have incorporated all the suggestions in V4. Documented 
 the impact of possible page cache side channel attacks with 
 suggested countermeasures.
 
 Details of project idea for 'virtio pmem' flushing interface 
 is shared [3] & [4].

 Implementation is divided into two parts:
 New virtio pmem guest driver and qemu code changes for new 
 virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
   - Reads persistent memory range from paravirt device and 
     registers with 'nvdimm_bus'.  
   - 'nvdimm/pmem' driver uses this information to allocate 
     persistent memory region and setup filesystem operations 
     to the allocated memory. 
   - virtio pmem driver implements asynchronous flushing 
     interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
   - Creates virtio pmem device and exposes a memory range to 
     KVM guest. 
   - At host side this is file backed memory which acts as 
     persistent memory. 
   - Qemu side flush uses aio thread pool API's and virtio 
     for asynchronous guest multi request handling. 

   David Hildenbrand CCed also posted a modified version[7] of 
   qemu virtio-pmem code based on updated Qemu memory device API. 

 Virtio-pmem security implications and countermeasures:
 -----------------------------------------------------

 In previous posting of kernel driver, there was discussion [9]
 on possible implications of page cache side channel attacks with 
 virtio pmem. After thorough analysis of details of known side 
 channel attacks, below are the suggestions:

 - Depends entirely on how host backing image file is mapped 
   into guest address space. 

 - virtio-pmem device emulation, by default shared mapping is used
   to map host backing file. It is recommended to use separate
   backing file at host side for every guest. This will prevent
   any possibility of executing common code from multiple guests
   and any chance of inferring guest local data based based on 
   execution time.

 - If backing file is required to be shared among multiple guests 
   it is recommended to don't support host page cache eviction 
   commands from the guest driver. This will avoid any possibility
   of inferring guest local data or host data from another guest. 

 - Proposed device specification [8] for virtio-pmem device with 
   details of possible security implications and suggested 
   countermeasures for device emulation.

 Virtio-pmem errors handling:
 ----------------------------------------
  Checked behaviour of virtio-pmem for below types of errors
  Need suggestions on expected behaviour for handling these errors?

  - Hardware Errors: Uncorrectable recoverable Errors: 
  a] virtio-pmem: 
    - As per current logic if error page belongs to Qemu process, 
      host MCE handler isolates(hwpoison) that page and send SIGBUS. 
      Qemu SIGBUS handler injects exception to KVM guest. 
    - KVM guest then isolates the page and send SIGBUS to guest 
      userspace process which has mapped the page. 
  
  b] Existing implementation for ACPI pmem driver: 
    - Handles such errors with MCE notifier and creates a list 
      of bad blocks. Read/direct access DAX operation return EIO 
      if accessed memory page fall in bad block list.
    - It also starts backgound scrubbing.  
    - Similar functionality can be reused in virtio-pmem with MCE 
      notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
      confirm if this behaviour is ok or needs any change?

Changes from PATCH v4: [1]
- Factor out MAP_SYNC supported functionality to a common helper
				[Dave, Darrick, Jan]
- Comment, indentation and virtqueue_kick failure handle - Yuval Shaia

Changes from PATCH v3: [2]
- Use generic dax_synchronous() helper to check for DAXDEV_SYNC 
  flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications & 
  countermeasures - [Dave Chinner, Michael]

Changes from PATCH v2: 
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan] 
- Use name 'virtio pmem' in place of 'fake dax' 

Changes from PATCH v1: 
- 0-day build test for build dependency on libnvdimm 

 Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem  
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c

Changes from RFC v2:
- Add flush function in the nd_region in place of switching
  on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
  for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan

Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent 
  memory and other operations instead of creating an entirely 
  new block driver.
- Use VIRTIO driver to register memory information with 
  nvdimm_bus and create region_type accordingly. 
- Call VIRTIO flush from existing pmem driver.

Pankaj Gupta (6):
   libnvdimm: nd_region flush callback support
   virtio-pmem: Add virtio-pmem guest driver
   libnvdimm: add nd_region buffered dax_dev flag
   dax: check synchronous mapping is supported
   ext4: disable map_sync for virtio pmem
   xfs: disable map_sync for virtio pmem

[1] https://lkml.org/lkml/2019/4/3/394
[2] https://lkml.org/lkml/2019/1/9/471
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html  
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2 
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
[9] https://lkml.org/lkml/2019/1/9/1191

 drivers/acpi/nfit/core.c         |    4 -
 drivers/dax/bus.c                |    2 
 drivers/dax/super.c              |   13 +++-
 drivers/md/dm.c                  |    2 
 drivers/nvdimm/claim.c           |    6 +
 drivers/nvdimm/nd.h              |    1 
 drivers/nvdimm/pmem.c            |   17 +++--
 drivers/nvdimm/region_devs.c     |   45 +++++++++++++-
 drivers/nvdimm/virtio_pmem.c     |   88 +++++++++++++++++++++++++++
 drivers/virtio/Kconfig           |   10 +++
 drivers/virtio/Makefile          |    1 
 drivers/virtio/pmem.c            |  124 +++++++++++++++++++++++++++++++++++++++
 fs/ext4/file.c                   |   11 +--
 fs/xfs/xfs_file.c                |   10 +--
 include/linux/dax.h              |   32 +++++++++-
 include/linux/libnvdimm.h        |    9 ++
 include/linux/virtio_pmem.h      |   60 ++++++++++++++++++
 include/uapi/linux/virtio_ids.h  |    1 
 include/uapi/linux/virtio_pmem.h |   10 +++
 19 files changed, 419 insertions(+), 27 deletions(-)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  2019-04-11 14:51   ` Dan Williams
  2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.

This also handles asynchronous flush requests from the block layer
by creating a child bio and chaining it with parent bio.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/acpi/nfit/core.c     |  4 ++--
 drivers/nvdimm/claim.c       |  6 ++++--
 drivers/nvdimm/nd.h          |  1 +
 drivers/nvdimm/pmem.c        | 14 ++++++++-----
 drivers/nvdimm/region_devs.c | 38 ++++++++++++++++++++++++++++++++++--
 include/linux/libnvdimm.h    |  8 +++++++-
 6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 5a389a4f4f65..567017a2190e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
 		offset = to_interleave_offset(offset, mmio);
 
 	writeq(cmd, mmio->addr.base + offset);
-	nvdimm_flush(nfit_blk->nd_region);
+	nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
 		readq(mmio->addr.base + offset);
@@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 	}
 
 	if (rw)
-		nvdimm_flush(nfit_blk->nd_region);
+		nvdimm_flush(nfit_blk->nd_region, NULL, false);
 
 	rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
 	return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf469c7..a1dfa066786b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
 	sector_t sector = offset >> 9;
-	int rc = 0;
+	int rc = 0, ret = 0;
 
 	if (unlikely(!size))
 		return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
 	}
 
 	memcpy_flushcache(nsio->addr + offset, buf, size);
-	nvdimm_flush(to_nd_region(ndns->dev.parent));
+	ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL, false);
+	if (ret)
+		rc = ret;
 
 	return rc;
 }
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index a5ac3b240293..916cd6c5451a 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -159,6 +159,7 @@ struct nd_region {
 	struct badblocks bb;
 	struct nd_interleave_set *nd_set;
 	struct nd_percpu_lane __percpu *lane;
+	int (*flush)(struct nd_region *nd_region);
 	struct nd_mapping mapping[0];
 };
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..5a5b3ea4d073 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 
 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 {
+	int ret = 0;
 	blk_status_t rc = 0;
 	bool do_acct;
 	unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct nd_region *nd_region = to_region(pmem);
 
 	if (bio->bi_opf & REQ_PREFLUSH)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
 		nd_iostat_end(bio, start);
 
 	if (bio->bi_opf & REQ_FUA)
-		nvdimm_flush(nd_region);
+		ret = nvdimm_flush(nd_region, bio, true);
+
+	if (ret)
+		bio->bi_status = errno_to_blk_status(ret);
 
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
@@ -463,13 +467,13 @@ static int pmem_attach_disk(struct device *dev,
 	disk->bb = &pmem->bb;
 
 	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
 	}
 	dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
 	pmem->dax_dev = dax_dev;
-
 	gendev = disk_to_dev(disk);
 	gendev->groups = pmem_attribute_groups;
 
@@ -527,14 +531,14 @@ static int nd_pmem_remove(struct device *dev)
 		sysfs_put(pmem->bb_state);
 		pmem->bb_state = NULL;
 	}
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 
 	return 0;
 }
 
 static void nd_pmem_shutdown(struct device *dev)
 {
-	nvdimm_flush(to_nd_region(dev->parent));
+	nvdimm_flush(to_nd_region(dev->parent), NULL, false);
 }
 
 static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..fb1041ab32a6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
 		return rc;
 	if (!flush)
 		return -EINVAL;
-	nvdimm_flush(nd_region);
+	rc = nvdimm_flush(nd_region, NULL, false);
+	if (rc)
+		return rc;
 
 	return len;
 }
@@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	dev->of_node = ndr_desc->of_node;
 	nd_region->ndr_size = resource_size(ndr_desc->res);
 	nd_region->ndr_start = ndr_desc->res->start;
+	if (ndr_desc->flush)
+		nd_region->flush = ndr_desc->flush;
+	else
+		nd_region->flush = generic_nvdimm_flush;
+
 	nd_device_register(dev);
 
 	return nd_region;
@@ -1125,11 +1132,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
 }
 EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
 
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
+{
+	int rc = 0;
+
+	/* Create child bio for asynchronous flush and chain with
+	 * parent bio. Otherwise directly call nd_region flush.
+	 */
+	if (async && bio->bi_iter.bi_sector != -1) {
+
+		struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+		if (!child)
+			return -ENOMEM;
+		bio_copy_dev(child, bio);
+		child->bi_opf = REQ_PREFLUSH;
+		child->bi_iter.bi_sector = -1;
+		bio_chain(child, bio);
+		submit_bio(child);
+	} else {
+		if (nd_region->flush(nd_region))
+			rc = -EIO;
+	}
+
+	return rc;
+}
 /**
  * nvdimm_flush - flush any posted write queues between the cpu and pmem media
  * @nd_region: blk or interleaved pmem region
  */
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
 {
 	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
 	int i, idx;
@@ -1153,6 +1185,8 @@ void nvdimm_flush(struct nd_region *nd_region)
 		if (ndrd_get_flush_wpq(ndrd, i, 0))
 			writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
 	wmb();
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(nvdimm_flush);
 
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index feb342d026f2..d9d2ab8a6e64 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -65,6 +65,9 @@ enum {
 	 */
 	ND_REGION_PERSIST_MEMCTRL = 2,
 
+	/* Platform provides asynchronous flush mechanism */
+	ND_REGION_ASYNC = 3,
+
 	/* mark newly adjusted resources as requiring a label update */
 	DPA_RESOURCE_ADJUSTED = 1 << 0,
 };
@@ -121,6 +124,7 @@ struct nd_mapping_desc {
 	int position;
 };
 
+struct nd_region;
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping_desc *mapping;
@@ -133,6 +137,7 @@ struct nd_region_desc {
 	int target_node;
 	unsigned long flags;
 	struct device_node *of_node;
+	int (*flush)(struct nd_region *nd_region);
 };
 
 struct device;
@@ -260,7 +265,8 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
 unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
 void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
 u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
+int generic_nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
 int nvdimm_has_cache(struct nd_region *nd_region);
 int nvdimm_in_overwrite(struct nvdimm *nvdimm);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 1/6] libnvdimm: nd_region flush callback support Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  2019-04-10 12:24   ` Cornelia Huck
                     ` (2 more replies)
  2019-04-10  4:08 ` [PATCH v5 3/6] libnvdimm: add dax_dev sync flag Pankaj Gupta
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
 drivers/virtio/Kconfig           |  10 +++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
 include/linux/virtio_pmem.h      |  60 +++++++++++++++
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  10 +++
 7 files changed, 294 insertions(+)
 create mode 100644 drivers/nvdimm/virtio_pmem.c
 create mode 100644 drivers/virtio/pmem.c
 create mode 100644 include/linux/virtio_pmem.h
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 000000000000..01044cc2f3a3
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+	unsigned int len;
+	unsigned long flags;
+	struct virtio_pmem_request *req, *req_buf;
+	struct virtio_pmem *vpmem = vq->vdev->priv;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+		req->done = true;
+		wake_up(&req->host_acked);
+
+		if (!list_empty(&vpmem->req_list)) {
+			req_buf = list_first_entry(&vpmem->req_list,
+					struct virtio_pmem_request, list);
+			list_del(&vpmem->req_list);
+			req_buf->wq_buf_avail = true;
+			wake_up(&req_buf->wq_buf);
+		}
+	}
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+	int err;
+	unsigned long flags;
+	struct scatterlist *sgs[2], sg, ret;
+	struct virtio_device *vdev = nd_region->provider_data;
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct virtio_pmem_request *req;
+
+	might_sleep();
+	req = kmalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->done = req->wq_buf_avail = false;
+	strcpy(req->name, "FLUSH");
+	init_waitqueue_head(&req->host_acked);
+	init_waitqueue_head(&req->wq_buf);
+	sg_init_one(&sg, req->name, strlen(req->name));
+	sgs[0] = &sg;
+	sg_init_one(&ret, &req->ret, sizeof(req->ret));
+	sgs[1] = &ret;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+	if (err) {
+		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+		list_add_tail(&vpmem->req_list, &req->list);
+		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+		/* When host has read buffer, this completes via host_ack */
+		wait_event(req->wq_buf, req->wq_buf_avail);
+		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	}
+	err = virtqueue_kick(vpmem->req_vq);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	if (!err) {
+		err = -EIO;
+		goto ret;
+	}
+	/* When host has read buffer, this completes via host_ack */
+	wait_event(req->host_acked, req->done);
+	err = req->ret;
+ret:
+	kfree(req);
+	return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..9f634a2ed638 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Support for virtio pmem driver"
+	depends on VIRTIO
+	depends on LIBNVDIMM
+	help
+	This driver provides support for virtio based flushing interface
+	for persistent memory range.
+
+	If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..143ce91eabe9 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
new file mode 100644
index 000000000000..cc9de9589d56
--- /dev/null
+++ b/drivers/virtio/pmem.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include <../../drivers/nvdimm/nd.h>
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+				host_ack, "flush_queue");
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	spin_lock_init(&vpmem->pmem_lock);
+	INIT_LIST_HEAD(&vpmem->req_list);
+
+	return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	int err = 0;
+	struct resource res;
+	struct virtio_pmem *vpmem;
+	struct nvdimm_bus *nvdimm_bus;
+	struct nd_region_desc ndr_desc = {};
+	int nid = dev_to_node(&vdev->dev);
+	struct nd_region *nd_region;
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+					GFP_KERNEL);
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out_err;
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+
+	res.start = vpmem->start;
+	res.end   = vpmem->start + vpmem->size-1;
+	vpmem->nd_desc.provider_name = "virtio-pmem";
+	vpmem->nd_desc.module = THIS_MODULE;
+
+	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+						&vpmem->nd_desc);
+	if (!nvdimm_bus)
+		goto out_vq;
+
+	dev_set_drvdata(&vdev->dev, nvdimm_bus);
+
+	ndr_desc.res = &res;
+	ndr_desc.numa_node = nid;
+	ndr_desc.flush = virtio_pmem_flush;
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+	nd_region->provider_data =  dev_to_virtio
+					(nd_region->dev.parent->parent);
+
+	if (!nd_region)
+		goto out_nd;
+
+	return 0;
+out_nd:
+	err = -ENXIO;
+	nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+	vdev->config->del_vqs(vdev);
+out_err:
+	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+	return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
+	vdev->config->reset(vdev);
+	kfree(vpmem);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	.remove			= virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 000000000000..224f9d934be6
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+	/* Host return status corresponding to flush request */
+	int ret;
+
+	/* command name*/
+	char name[16];
+
+	/* Wait queue to process deferred work after ack from host */
+	wait_queue_head_t host_acked;
+	bool done;
+
+	/* Wait queue to process deferred work after virt queue buffer avail */
+	wait_queue_head_t wq_buf;
+	bool wq_buf_avail;
+	struct list_head list;
+};
+
+struct virtio_pmem {
+	struct virtio_device *vdev;
+
+	/* Virtio pmem request queue */
+	struct virtqueue *req_vq;
+
+	/* nvdimm bus registers virtio pmem device */
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm_bus_descriptor nd_desc;
+
+	/* List to store deferred work if virtqueue is full */
+	struct list_head req_list;
+
+	/* Synchronize virtqueue data */
+	spinlock_t pmem_lock;
+
+	/* Memory region information */
+	uint64_t start;
+	uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int virtio_pmem_flush(struct nd_region *nd_region);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..346389565ac1 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         25 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 000000000000..fa3f7d52717a
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+	__le64 start;
+	__le64 size;
+};
+#endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 1/6] libnvdimm: nd_region flush callback support Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  2019-04-10  8:28   ` Jan Kara
  2019-04-11 14:56   ` Dan Williams
  2019-04-10  4:08 ` [PATCH v5 4/6] dax: check synchronous mapping is supported Pankaj Gupta
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/dax/bus.c            |  2 +-
 drivers/dax/super.c          | 13 ++++++++++++-
 drivers/md/dm.c              |  2 +-
 drivers/nvdimm/pmem.c        |  3 ++-
 drivers/nvdimm/region_devs.c |  7 +++++++
 include/linux/dax.h          |  9 +++++++--
 include/linux/libnvdimm.h    |  1 +
 7 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..431bf7d2a7f9 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
 	 * No 'host' or dax_operations since there is no access to this
 	 * device outside of mmap of the resulting character device.
 	 */
-	dax_dev = alloc_dax(dev_dax, NULL, NULL);
+	dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
 	if (!dax_dev)
 		goto err;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0a339b85133e..bd6509308d05 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -186,6 +186,8 @@ enum dax_device_flags {
 	DAXDEV_ALIVE,
 	/* gate whether dax_flush() calls the low level flush routine */
 	DAXDEV_WRITE_CACHE,
+	/* flag to check if device supports synchronous flush */
+	DAXDEV_SYNC,
 };
 
 /**
@@ -354,6 +356,12 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
 
+bool dax_synchronous(struct dax_device *dax_dev)
+{
+	return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_synchronous);
+
 bool dax_alive(struct dax_device *dax_dev)
 {
 	lockdep_assert_held(&dax_srcu);
@@ -511,7 +519,7 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host)
 }
 
 struct dax_device *alloc_dax(void *private, const char *__host,
-		const struct dax_operations *ops)
+		const struct dax_operations *ops, bool sync)
 {
 	struct dax_device *dax_dev;
 	const char *host;
@@ -534,6 +542,9 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 	dax_add_host(dax_dev, host);
 	dax_dev->ops = ops;
 	dax_dev->private = private;
+	if (sync)
+		set_bit(DAXDEV_SYNC, &dax_dev->flags);
+
 	return dax_dev;
 
  err_dev:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 68d24056d0b1..534e12ca6329 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1965,7 +1965,7 @@ static struct mapped_device *alloc_dev(int minor)
 	sprintf(md->disk->disk_name, "dm-%d", minor);
 
 	if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
-		dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+		dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops, true);
 		if (!dax_dev)
 			goto bad;
 	}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5a5b3ea4d073..78f71ba0e7cf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -466,7 +466,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops,
+					is_nvdimm_sync(nd_region));
 
 	if (!dax_dev) {
 		put_disk(disk);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fb1041ab32a6..8c7aa047fe2b 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1231,6 +1231,13 @@ int nvdimm_has_cache(struct nd_region *nd_region)
 }
 EXPORT_SYMBOL_GPL(nvdimm_has_cache);
 
+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+	return is_nd_pmem(&nd_region->dev) &&
+		!test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
 struct conflict_context {
 	struct nd_region *nd_region;
 	resource_size_t start, size;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..b896706a5ee9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,18 +32,19 @@ extern struct attribute_group dax_attribute_group;
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *dax_get_by_host(const char *host);
 struct dax_device *alloc_dax(void *private, const char *host,
-		const struct dax_operations *ops);
+		const struct dax_operations *ops, bool sync);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
+bool dax_synchronous(struct dax_device *dax_dev);
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
 	return NULL;
 }
 static inline struct dax_device *alloc_dax(void *private, const char *host,
-		const struct dax_operations *ops)
+		const struct dax_operations *ops, bool sync)
 {
 	/*
 	 * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
@@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
 {
 	return false;
 }
+static inline bool dax_synchronous(struct dax_device *dax_dev)
+{
+	return true;
+}
 #endif
 
 struct writeback_control;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index d9d2ab8a6e64..9a8aea370cbc 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -270,6 +270,7 @@ int generic_nvdimm_flush(struct nd_region *nd_region);
 int nvdimm_has_flush(struct nd_region *nd_region);
 int nvdimm_has_cache(struct nd_region *nd_region);
 int nvdimm_in_overwrite(struct nvdimm *nvdimm);
+bool is_nvdimm_sync(struct nd_region *nd_region);
 
 static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
 		unsigned int buf_len, int *cmd_rc)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v5 4/6] dax: check synchronous mapping is supported
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
                   ` (2 preceding siblings ...)
  2019-04-10  4:08 ` [PATCH v5 3/6] libnvdimm: add dax_dev sync flag Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  2019-04-10  8:25   ` Jan Kara
  2019-04-10  4:08 ` [PATCH v5 5/6] ext4: disable map_sync for async flush Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 6/6] xfs: " Pankaj Gupta
  5 siblings, 1 reply; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

This patch introduces 'daxdev_mapping_supported' helper
which checks if 'MAP_SYNC' is supported with filesystem
mapping. It also checks if corresponding dax_device is
synchronous. Virtio pmem device is asynchronous and
does not not support VM_SYNC. 

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 include/linux/dax.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b896706a5ee9..4a2a60ffec86 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
+
+/*
+ * Callers check if synchronous mapping is enabled for DAX file
+ * and attached dax device is also synchronous.
+ *
+ * dax_synchronous function verifies if dax device is synchronous.
+ * Currently, only virtio pmem device supports asynchronous device
+ * flush.
+ */
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+				struct dax_device *dax_dev)
+{
+	if (!(vma->vm_flags & VM_SYNC))
+		return true;
+	if (!IS_DAX(file_inode(vma->vm_file)))
+		return false;
+	return dax_synchronous(dax_dev);
+}
 #else
 static inline struct dax_device *dax_get_by_host(const char *host)
 {
@@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
 {
 	return true;
 }
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+				struct dax_device *dax_dev)
+{
+	return true;
+}
 #endif
 
 struct writeback_control;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v5 5/6] ext4: disable map_sync for async flush
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
                   ` (3 preceding siblings ...)
  2019-04-10  4:08 ` [PATCH v5 4/6] dax: check synchronous mapping is supported Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  2019-04-10  4:08 ` [PATCH v5 6/6] xfs: " Pankaj Gupta
  5 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides 
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 fs/ext4/file.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69d65d49837b..4b2ccaf1932e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -360,15 +360,16 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_mapping->host;
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	struct dax_device *dax_dev = sbi->s_daxdev;
 
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+	if (unlikely(ext4_forced_shutdown(sbi)))
 		return -EIO;
 
-	/*
-	 * We don't support synchronous mappings for non-DAX files. At least
-	 * until someone comes with a sensible use case.
+	/* We don't support synchronous mappings for non-DAX files and
+	 * for DAX files if underneath dax_device is not synchronous.
 	 */
-	if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
 	file_accessed(file);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v5 6/6] xfs: disable map_sync for async flush
  2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
                   ` (4 preceding siblings ...)
  2019-04-10  4:08 ` [PATCH v5 5/6] ext4: disable map_sync for async flush Pankaj Gupta
@ 2019-04-10  4:08 ` Pankaj Gupta
  5 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  4:08 UTC (permalink / raw)
  To: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs
  Cc: dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, cohuck, xiaoguangrong.eric,
	pbonzini, kilobyte, yuval.shaia, pagupta

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 fs/xfs/xfs_file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1f2e2845eb76..0e59be018511 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1196,11 +1196,13 @@ xfs_file_mmap(
 	struct file	*filp,
 	struct vm_area_struct *vma)
 {
-	/*
-	 * We don't support synchronous mappings for non-DAX files. At least
-	 * until someone comes with a sensible use case.
+	struct dax_device *dax_dev = xfs_find_daxdev_for_inode
+						(file_inode(filp));
+
+	/* We don't support synchronous mappings for non-DAX files and
+	 * for DAX files if underneath dax_device is not synchronous.
 	 */
-	if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
 	file_accessed(filp);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 4/6] dax: check synchronous mapping is supported
  2019-04-10  4:08 ` [PATCH v5 4/6] dax: check synchronous mapping is supported Pankaj Gupta
@ 2019-04-10  8:25   ` Jan Kara
  2019-04-10  8:31     ` Pankaj Gupta
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2019-04-10  8:25 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan.j.williams,
	zwisler, vishal.l.verma, dave.jiang, mst, jasowang, willy, rjw,
	hch, lenb, jack, tytso, adilger.kernel, darrick.wong,
	lcapitulino, kwolf, imammedo, jmoyer, nilal, riel, stefanha,
	aarcange, david, david, cohuck, xiaoguangrong.eric, pbonzini,
	kilobyte, yuval.shaia

On Wed 10-04-19 09:38:24, Pankaj Gupta wrote:
> This patch introduces 'daxdev_mapping_supported' helper
> which checks if 'MAP_SYNC' is supported with filesystem
> mapping. It also checks if corresponding dax_device is
> synchronous. Virtio pmem device is asynchronous and
> does not not support VM_SYNC. 
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  include/linux/dax.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b896706a5ee9..4a2a60ffec86 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);
>  bool dax_write_cache_enabled(struct dax_device *dax_dev);
>  bool dax_synchronous(struct dax_device *dax_dev);
> +
> +/*
> + * Callers check if synchronous mapping is enabled for DAX file
> + * and attached dax device is also synchronous.
> + *
> + * dax_synchronous function verifies if dax device is synchronous.
> + * Currently, only virtio pmem device supports asynchronous device
> + * flush.
> + */

Thanks for the patch! I'd restructure this comment like:

/*
 * Check if given mapping is supported by the file / underlying device.
 */
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> +				struct dax_device *dax_dev)
> +{

	/* Everyone supports non-sync mappings */
> +	if (!(vma->vm_flags & VM_SYNC))
> +		return true;
	/* Sync mappings are supported only for files using DAX */
> +	if (!IS_DAX(file_inode(vma->vm_file)))
> +		return false;
	/* Underlying device must support persisting through CPU instructions */
> +	return dax_synchronous(dax_dev);
> +}
>  #else
>  static inline struct dax_device *dax_get_by_host(const char *host)
>  {
> @@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
>  {
>  	return true;
>  }
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> +				struct dax_device *dax_dev)
> +{
> +	return true;

This looks wrong. Shouldn't it rather be:

	return !(vma->flags & VM_SYNC);

?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
  2019-04-10  4:08 ` [PATCH v5 3/6] libnvdimm: add dax_dev sync flag Pankaj Gupta
@ 2019-04-10  8:28   ` Jan Kara
  2019-04-10  8:38     ` Pankaj Gupta
  2019-04-11 14:56   ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2019-04-10  8:28 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan.j.williams,
	zwisler, vishal.l.verma, dave.jiang, mst, jasowang, willy, rjw,
	hch, lenb, jack, tytso, adilger.kernel, darrick.wong,
	lcapitulino, kwolf, imammedo, jmoyer, nilal, riel, stefanha,
	aarcange, david, david, cohuck, xiaoguangrong.eric, pbonzini,
	kilobyte, yuval.shaia

On Wed 10-04-19 09:38:23, Pankaj Gupta wrote:
> @@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
>  {
>  	return false;
>  }
> +static inline bool dax_synchronous(struct dax_device *dax_dev)
> +{
> +	return true;
> +}

Is there a need to define dax_synchronous() for !CONFIG_DAX? Because that
property of dax device is pretty much undefined and I don't see anything
needing to call it for !CONFIG_DAX...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 4/6] dax: check synchronous mapping is supported
  2019-04-10  8:25   ` Jan Kara
@ 2019-04-10  8:31     ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  8:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan j williams,
	zwisler, vishal l verma, dave jiang, mst, jasowang, willy, rjw,
	hch, lenb, tytso, adilger kernel, darrick wong, lcapitulino,
	kwolf, imammedo, jmoyer, nilal, riel, stefanha, aarcange, david,
	david, cohuck, xiaoguangrong eric, pbonzini, kilobyte,
	yuval shaia


> > This patch introduces 'daxdev_mapping_supported' helper
> > which checks if 'MAP_SYNC' is supported with filesystem
> > mapping. It also checks if corresponding dax_device is
> > synchronous. Virtio pmem device is asynchronous and
> > does not not support VM_SYNC.
> > 
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  include/linux/dax.h | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b896706a5ee9..4a2a60ffec86 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
> >  void dax_write_cache(struct dax_device *dax_dev, bool wc);
> >  bool dax_write_cache_enabled(struct dax_device *dax_dev);
> >  bool dax_synchronous(struct dax_device *dax_dev);
> > +
> > +/*
> > + * Callers check if synchronous mapping is enabled for DAX file
> > + * and attached dax device is also synchronous.
> > + *
> > + * dax_synchronous function verifies if dax device is synchronous.
> > + * Currently, only virtio pmem device supports asynchronous device
> > + * flush.
> > + */
> 
> Thanks for the patch! I'd restructure this comment like:
> 
> /*
>  * Check if given mapping is supported by the file / underlying device.
>  */

Sure.

> > +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> > +				struct dax_device *dax_dev)
> > +{
> 
> 	/* Everyone supports non-sync mappings */
> > +	if (!(vma->vm_flags & VM_SYNC))
> > +		return true;
> 	/* Sync mappings are supported only for files using DAX */
> > +	if (!IS_DAX(file_inode(vma->vm_file)))
> > +		return false;
> 	/* Underlying device must support persisting through CPU instructions */
> > +	return dax_synchronous(dax_dev);
> > +}
> >  #else
> >  static inline struct dax_device *dax_get_by_host(const char *host)
> >  {
> > @@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device
> > *dax_dev)
> >  {
> >  	return true;
> >  }
> > +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> > +				struct dax_device *dax_dev)
> > +{
> > +	return true;
> 
> This looks wrong. Shouldn't it rather be:
> 
> 	return !(vma->flags & VM_SYNC);
> 
> ?

Right. I will correct this.

Thanks,
Pankaj

> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
  2019-04-10  8:28   ` Jan Kara
@ 2019-04-10  8:38     ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10  8:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan j williams,
	zwisler, vishal l verma, dave jiang, mst, jasowang, willy, rjw,
	hch, lenb, tytso, adilger kernel, darrick wong, lcapitulino,
	kwolf, imammedo, jmoyer, nilal, riel, stefanha, aarcange, david,
	david, cohuck, xiaoguangrong eric, pbonzini, kilobyte,
	yuval shaia


> 
> On Wed 10-04-19 09:38:23, Pankaj Gupta wrote:
> > @@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct
> > dax_device *dax_dev)
> >  {
> >  	return false;
> >  }
> > +static inline bool dax_synchronous(struct dax_device *dax_dev)
> > +{
> > +	return true;
> > +}
> 
> Is there a need to define dax_synchronous() for !CONFIG_DAX? Because that
> property of dax device is pretty much undefined and I don't see anything
> needing to call it for !CONFIG_DAX...

You are right. Will remove this.

Thanks for the review.

Best regards,
Pankaj

> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
@ 2019-04-10 12:24   ` Cornelia Huck
  2019-04-10 14:38     ` Yuval Shaia
  2019-04-10 15:38     ` Pankaj Gupta
  2019-04-10 13:12   ` Michael S. Tsirkin
  2019-04-10 14:41   ` Yuval Shaia
  2 siblings, 2 replies; 38+ messages in thread
From: Cornelia Huck @ 2019-04-10 12:24 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan.j.williams,
	zwisler, vishal.l.verma, dave.jiang, mst, jasowang, willy, rjw,
	hch, lenb, jack, tytso, adilger.kernel, darrick.wong,
	lcapitulino, kwolf, imammedo, jmoyer, nilal, riel, stefanha,
	aarcange, david, david, xiaoguangrong.eric, pbonzini, kilobyte,
	yuval.shaia

On Wed, 10 Apr 2019 09:38:22 +0530
Pankaj Gupta <pagupta@redhat.com> wrote:

> This patch adds virtio-pmem driver for KVM guest.
> 
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
> 
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
>  drivers/virtio/Kconfig           |  10 +++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
>  include/linux/virtio_pmem.h      |  60 +++++++++++++++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  10 +++
>  7 files changed, 294 insertions(+)
>  create mode 100644 drivers/nvdimm/virtio_pmem.c
>  create mode 100644 drivers/virtio/pmem.c
>  create mode 100644 include/linux/virtio_pmem.h
>  create mode 100644 include/uapi/linux/virtio_pmem.h
> 
(...)
> diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> new file mode 100644
> index 000000000000..cc9de9589d56
> --- /dev/null
> +++ b/drivers/virtio/pmem.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include <../../drivers/nvdimm/nd.h>
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)

IMHO, you don't gain much by splitting off this function...

> +{
> +	struct virtqueue *vq;
> +
> +	/* single vq */
> +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> +				host_ack, "flush_queue");
> +	if (IS_ERR(vq))
> +		return PTR_ERR(vq);

I'm personally not a fan of chained assignments... I think I'd just
drop the 'vq' variable and operate on vpmem->req_vq directly.

> +
> +	spin_lock_init(&vpmem->pmem_lock);
> +	INIT_LIST_HEAD(&vpmem->req_list);
> +
> +	return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +	int err = 0;
> +	struct resource res;
> +	struct virtio_pmem *vpmem;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nd_region_desc ndr_desc = {};
> +	int nid = dev_to_node(&vdev->dev);
> +	struct nd_region *nd_region;
> +
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config disabled\n",

Maybe s/config disabled/config access disabled/ ? That seems to be the
more common message.

> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +					GFP_KERNEL);

Here, the vpmem variable makes sense for convenience, but I'm again not
a fan of the chaining :)

> +	if (!vpmem) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	vpmem->vdev = vdev;
> +	err = init_vq(vpmem);
> +	if (err)
> +		goto out_err;
> +
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			start, &vpmem->start);
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			size, &vpmem->size);
> +
> +	res.start = vpmem->start;
> +	res.end   = vpmem->start + vpmem->size-1;
> +	vpmem->nd_desc.provider_name = "virtio-pmem";
> +	vpmem->nd_desc.module = THIS_MODULE;
> +
> +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> +						&vpmem->nd_desc);

And here :)

> +	if (!nvdimm_bus)
> +		goto out_vq;
> +
> +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> +	ndr_desc.res = &res;
> +	ndr_desc.numa_node = nid;
> +	ndr_desc.flush = virtio_pmem_flush;
> +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +	nd_region->provider_data =  dev_to_virtio
> +					(nd_region->dev.parent->parent);

Isn't it clear that this parent chain will always end up at &vdev->dev?
Maybe simply set ->provider_data to vdev directly? (Does it need to
grab a reference count of the device, BTW?)

> +
> +	if (!nd_region)
> +		goto out_nd;

Probably better to do this check before you access nd_region's
members :)

> +
> +	return 0;
> +out_nd:
> +	err = -ENXIO;
> +	nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> +	vdev->config->del_vqs(vdev);
> +out_err:
> +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> +	return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_pmem *vpmem = vdev->priv;
> +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> +	nvdimm_bus_unregister(nvdimm_bus);

I haven't followed this around the nvdimm code, but is the nd_region
you created during probe cleaned up automatically, or would you need to
do something here?

> +	vdev->config->del_vqs(vdev);
> +	vdev->config->reset(vdev);
> +	kfree(vpmem);

You allocated vpmem via devm_kzalloc; isn't it freed automatically on
remove?

> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtio_pmem_probe,
> +	.remove			= virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");

Only looked at this from the general virtio driver angle; seems fine
apart from some easy-to-fix issues and some personal style preference
things.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
  2019-04-10 12:24   ` Cornelia Huck
@ 2019-04-10 13:12   ` Michael S. Tsirkin
  2019-04-10 14:03     ` [Qemu-devel] " Pankaj Gupta
  2019-04-10 14:41   ` Yuval Shaia
  2 siblings, 1 reply; 38+ messages in thread
From: Michael S. Tsirkin @ 2019-04-10 13:12 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan.j.williams,
	zwisler, vishal.l.verma, dave.jiang, jasowang, willy, rjw, hch,
	lenb, jack, tytso, adilger.kernel, darrick.wong, lcapitulino,
	kwolf, imammedo, jmoyer, nilal, riel, stefanha, aarcange, david,
	david, cohuck, xiaoguangrong.eric, pbonzini, kilobyte,
	yuval.shaia

On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> This patch adds virtio-pmem driver for KVM guest.
> 
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
> 
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
>  drivers/virtio/Kconfig           |  10 +++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
>  include/linux/virtio_pmem.h      |  60 +++++++++++++++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  10 +++
>  7 files changed, 294 insertions(+)
>  create mode 100644 drivers/nvdimm/virtio_pmem.c
>  create mode 100644 drivers/virtio/pmem.c
>  create mode 100644 include/linux/virtio_pmem.h
>  create mode 100644 include/uapi/linux/virtio_pmem.h
> 
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index 000000000000..01044cc2f3a3
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio_pmem.h>
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> +	unsigned int len;
> +	unsigned long flags;
> +	struct virtio_pmem_request *req, *req_buf;
> +	struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> +		req->done = true;
> +		wake_up(&req->host_acked);
> +
> +		if (!list_empty(&vpmem->req_list)) {
> +			req_buf = list_first_entry(&vpmem->req_list,
> +					struct virtio_pmem_request, list);
> +			list_del(&vpmem->req_list);
> +			req_buf->wq_buf_avail = true;
> +			wake_up(&req_buf->wq_buf);
> +		}
> +	}
> +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> +	int err;
> +	unsigned long flags;
> +	struct scatterlist *sgs[2], sg, ret;
> +	struct virtio_device *vdev = nd_region->provider_data;
> +	struct virtio_pmem *vpmem = vdev->priv;
> +	struct virtio_pmem_request *req;
> +
> +	might_sleep();
> +	req = kmalloc(sizeof(*req), GFP_KERNEL);
> +	if (!req)
> +		return -ENOMEM;
> +
> +	req->done = req->wq_buf_avail = false;
> +	strcpy(req->name, "FLUSH");
> +	init_waitqueue_head(&req->host_acked);
> +	init_waitqueue_head(&req->wq_buf);
> +	sg_init_one(&sg, req->name, strlen(req->name));
> +	sgs[0] = &sg;
> +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> +	sgs[1] = &ret;
> +
> +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> +	if (err) {
> +		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> +		list_add_tail(&vpmem->req_list, &req->list);
> +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +		/* When host has read buffer, this completes via host_ack */
> +		wait_event(req->wq_buf, req->wq_buf_avail);
> +		spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +	}
> +	err = virtqueue_kick(vpmem->req_vq);
> +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +	if (!err) {
> +		err = -EIO;
> +		goto ret;
> +	}
> +	/* When host has read buffer, this completes via host_ack */
> +	wait_event(req->host_acked, req->done);
> +	err = req->ret;
> +ret:
> +	kfree(req);
> +	return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..9f634a2ed638 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
>  
>  	  If unsure, say Y.
>  
> +config VIRTIO_PMEM
> +	tristate "Support for virtio pmem driver"
> +	depends on VIRTIO
> +	depends on LIBNVDIMM
> +	help
> +	This driver provides support for virtio based flushing interface
> +	for persistent memory range.
> +
> +	If unsure, say M.
> +
>  config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5dcf46..143ce91eabe9 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> new file mode 100644
> index 000000000000..cc9de9589d56
> --- /dev/null
> +++ b/drivers/virtio/pmem.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include <../../drivers/nvdimm/nd.h>
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> +	struct virtqueue *vq;
> +
> +	/* single vq */
> +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> +				host_ack, "flush_queue");
> +	if (IS_ERR(vq))
> +		return PTR_ERR(vq);
> +
> +	spin_lock_init(&vpmem->pmem_lock);
> +	INIT_LIST_HEAD(&vpmem->req_list);
> +
> +	return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +	int err = 0;
> +	struct resource res;
> +	struct virtio_pmem *vpmem;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nd_region_desc ndr_desc = {};
> +	int nid = dev_to_node(&vdev->dev);
> +	struct nd_region *nd_region;
> +
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +					GFP_KERNEL);
> +	if (!vpmem) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	vpmem->vdev = vdev;
> +	err = init_vq(vpmem);
> +	if (err)
> +		goto out_err;
> +
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			start, &vpmem->start);
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			size, &vpmem->size);
> +
> +	res.start = vpmem->start;
> +	res.end   = vpmem->start + vpmem->size-1;
> +	vpmem->nd_desc.provider_name = "virtio-pmem";
> +	vpmem->nd_desc.module = THIS_MODULE;
> +
> +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> +						&vpmem->nd_desc);
> +	if (!nvdimm_bus)
> +		goto out_vq;
> +
> +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> +	ndr_desc.res = &res;
> +	ndr_desc.numa_node = nid;
> +	ndr_desc.flush = virtio_pmem_flush;
> +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +	nd_region->provider_data =  dev_to_virtio
> +					(nd_region->dev.parent->parent);
> +
> +	if (!nd_region)
> +		goto out_nd;
> +
> +	return 0;
> +out_nd:
> +	err = -ENXIO;
> +	nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> +	vdev->config->del_vqs(vdev);
> +out_err:
> +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> +	return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_pmem *vpmem = vdev->priv;
> +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> +	nvdimm_bus_unregister(nvdimm_bus);
> +	vdev->config->del_vqs(vdev);
> +	vdev->config->reset(vdev);
> +	kfree(vpmem);
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtio_pmem_probe,
> +	.remove			= virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..224f9d934be6
> --- /dev/null
> +++ b/include/linux/virtio_pmem.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * virtio_pmem.h: virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + **/
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include <linux/virtio_ids.h>
> +#include <linux/module.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/spinlock.h>
> +
> +struct virtio_pmem_request {
> +	/* Host return status corresponding to flush request */
> +	int ret;
> +
> +	/* command name*/
> +	char name[16];
> +
> +	/* Wait queue to process deferred work after ack from host */
> +	wait_queue_head_t host_acked;
> +	bool done;
> +
> +	/* Wait queue to process deferred work after virt queue buffer avail */
> +	wait_queue_head_t wq_buf;
> +	bool wq_buf_avail;
> +	struct list_head list;
> +};
> +
> +struct virtio_pmem {
> +	struct virtio_device *vdev;
> +
> +	/* Virtio pmem request queue */
> +	struct virtqueue *req_vq;
> +
> +	/* nvdimm bus registers virtio pmem device */
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nvdimm_bus_descriptor nd_desc;
> +
> +	/* List to store deferred work if virtqueue is full */
> +	struct list_head req_list;
> +
> +	/* Synchronize virtqueue data */
> +	spinlock_t pmem_lock;
> +
> +	/* Memory region information */
> +	uint64_t start;
> +	uint64_t size;
> +};
> +
> +void host_ack(struct virtqueue *vq);
> +int virtio_pmem_flush(struct nd_region *nd_region);
> +#endif
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..346389565ac1 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
>  #define VIRTIO_ID_INPUT        18 /* virtio input */
>  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
>  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
>  
>  #endif /* _LINUX_VIRTIO_IDS_H */

Didn't Paolo point out someone is using 25 for audio?

Please, reserve an ID with the Virtio TC before using it.

> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..fa3f7d52717a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> +	__le64 start;
> +	__le64 size;
> +};
> +#endif
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 13:12   ` Michael S. Tsirkin
@ 2019-04-10 14:03     ` Pankaj Gupta
  2019-04-10 14:31       ` Cornelia Huck
  0 siblings, 1 reply; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10 14:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cohuck, jack, kvm, david, jasowang, david, qemu-devel,
	virtualization, adilger kernel, zwisler, aarcange, dave jiang,
	linux-nvdimm, vishal l verma, willy, hch, linux-acpi, jmoyer,
	linux-ext4, lenb, kilobyte, riel, yuval shaia, stefanha,
	pbonzini, dan j williams, lcapitulino, kwolf, nilal, tytso,
	xiaoguangrong eric, darrick wong, rjw, linux-kernel, linux-xfs,
	linux-fsdevel, imammedo


> 
> On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> > This patch adds virtio-pmem driver for KVM guest.
> > 
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> > 
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
> >  drivers/virtio/Kconfig           |  10 +++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
> >  include/linux/virtio_pmem.h      |  60 +++++++++++++++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  10 +++
> >  7 files changed, 294 insertions(+)
> >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> >  create mode 100644 drivers/virtio/pmem.c
> >  create mode 100644 include/linux/virtio_pmem.h
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..01044cc2f3a3
> > --- /dev/null
> > +++ b/drivers/nvdimm/virtio_pmem.c
> > @@ -0,0 +1,88 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > + /* The interrupt handler */
> > +void host_ack(struct virtqueue *vq)
> > +{
> > +	unsigned int len;
> > +	unsigned long flags;
> > +	struct virtio_pmem_request *req, *req_buf;
> > +	struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > +		req->done = true;
> > +		wake_up(&req->host_acked);
> > +
> > +		if (!list_empty(&vpmem->req_list)) {
> > +			req_buf = list_first_entry(&vpmem->req_list,
> > +					struct virtio_pmem_request, list);
> > +			list_del(&vpmem->req_list);
> > +			req_buf->wq_buf_avail = true;
> > +			wake_up(&req_buf->wq_buf);
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(host_ack);
> > +
> > + /* The request submission function */
> > +int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > +	int err;
> > +	unsigned long flags;
> > +	struct scatterlist *sgs[2], sg, ret;
> > +	struct virtio_device *vdev = nd_region->provider_data;
> > +	struct virtio_pmem *vpmem = vdev->priv;
> > +	struct virtio_pmem_request *req;
> > +
> > +	might_sleep();
> > +	req = kmalloc(sizeof(*req), GFP_KERNEL);
> > +	if (!req)
> > +		return -ENOMEM;
> > +
> > +	req->done = req->wq_buf_avail = false;
> > +	strcpy(req->name, "FLUSH");
> > +	init_waitqueue_head(&req->host_acked);
> > +	init_waitqueue_head(&req->wq_buf);
> > +	sg_init_one(&sg, req->name, strlen(req->name));
> > +	sgs[0] = &sg;
> > +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > +	sgs[1] = &ret;
> > +
> > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > +	if (err) {
> > +		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > +
> > +		list_add_tail(&vpmem->req_list, &req->list);
> > +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +		/* When host has read buffer, this completes via host_ack */
> > +		wait_event(req->wq_buf, req->wq_buf_avail);
> > +		spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +	}
> > +	err = virtqueue_kick(vpmem->req_vq);
> > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +	if (!err) {
> > +		err = -EIO;
> > +		goto ret;
> > +	}
> > +	/* When host has read buffer, this completes via host_ack */
> > +	wait_event(req->host_acked, req->done);
> > +	err = req->ret;
> > +ret:
> > +	kfree(req);
> > +	return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 35897649c24f..9f634a2ed638 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> >  
> >  	  If unsure, say Y.
> >  
> > +config VIRTIO_PMEM
> > +	tristate "Support for virtio pmem driver"
> > +	depends on VIRTIO
> > +	depends on LIBNVDIMM
> > +	help
> > +	This driver provides support for virtio based flushing interface
> > +	for persistent memory range.
> > +
> > +	If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> >  	tristate "Virtio balloon driver"
> >  	depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5dcf46..143ce91eabe9 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +	{ 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +	struct virtqueue *vq;
> > +
> > +	/* single vq */
> > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +				host_ack, "flush_queue");
> > +	if (IS_ERR(vq))
> > +		return PTR_ERR(vq);
> > +
> > +	spin_lock_init(&vpmem->pmem_lock);
> > +	INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > +	return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +	int err = 0;
> > +	struct resource res;
> > +	struct virtio_pmem *vpmem;
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nd_region_desc ndr_desc = {};
> > +	int nid = dev_to_node(&vdev->dev);
> > +	struct nd_region *nd_region;
> > +
> > +	if (!vdev->config->get) {
> > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +			__func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +					GFP_KERNEL);
> > +	if (!vpmem) {
> > +		err = -ENOMEM;
> > +		goto out_err;
> > +	}
> > +
> > +	vpmem->vdev = vdev;
> > +	err = init_vq(vpmem);
> > +	if (err)
> > +		goto out_err;
> > +
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			start, &vpmem->start);
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			size, &vpmem->size);
> > +
> > +	res.start = vpmem->start;
> > +	res.end   = vpmem->start + vpmem->size-1;
> > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > +	vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > +						&vpmem->nd_desc);
> > +	if (!nvdimm_bus)
> > +		goto out_vq;
> > +
> > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > +	ndr_desc.res = &res;
> > +	ndr_desc.numa_node = nid;
> > +	ndr_desc.flush = virtio_pmem_flush;
> > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +	nd_region->provider_data =  dev_to_virtio
> > +					(nd_region->dev.parent->parent);
> > +
> > +	if (!nd_region)
> > +		goto out_nd;
> > +
> > +	return 0;
> > +out_nd:
> > +	err = -ENXIO;
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > +	vdev->config->del_vqs(vdev);
> > +out_err:
> > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > +	return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > +	struct virtio_pmem *vpmem = vdev->priv;
> > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +	vdev->config->del_vqs(vdev);
> > +	vdev->config->reset(vdev);
> > +	kfree(vpmem);
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +	.driver.name		= KBUILD_MODNAME,
> > +	.driver.owner		= THIS_MODULE,
> > +	.id_table		= id_table,
> > +	.probe			= virtio_pmem_probe,
> > +	.remove			= virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..224f9d934be6
> > --- /dev/null
> > +++ b/include/linux/virtio_pmem.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * virtio_pmem.h: virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + **/
> > +
> > +#ifndef _LINUX_VIRTIO_PMEM_H
> > +#define _LINUX_VIRTIO_PMEM_H
> > +
> > +#include <linux/virtio_ids.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/spinlock.h>
> > +
> > +struct virtio_pmem_request {
> > +	/* Host return status corresponding to flush request */
> > +	int ret;
> > +
> > +	/* command name*/
> > +	char name[16];
> > +
> > +	/* Wait queue to process deferred work after ack from host */
> > +	wait_queue_head_t host_acked;
> > +	bool done;
> > +
> > +	/* Wait queue to process deferred work after virt queue buffer avail */
> > +	wait_queue_head_t wq_buf;
> > +	bool wq_buf_avail;
> > +	struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > +	struct virtio_device *vdev;
> > +
> > +	/* Virtio pmem request queue */
> > +	struct virtqueue *req_vq;
> > +
> > +	/* nvdimm bus registers virtio pmem device */
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nvdimm_bus_descriptor nd_desc;
> > +
> > +	/* List to store deferred work if virtqueue is full */
> > +	struct list_head req_list;
> > +
> > +	/* Synchronize virtqueue data */
> > +	spinlock_t pmem_lock;
> > +
> > +	/* Memory region information */
> > +	uint64_t start;
> > +	uint64_t size;
> > +};
> > +
> > +void host_ack(struct virtqueue *vq);
> > +int virtio_pmem_flush(struct nd_region *nd_region);
> > +#endif
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2d4f4d..346389565ac1 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> >  
> >  #endif /* _LINUX_VIRTIO_IDS_H */
> 
> Didn't Paolo point out someone is using 25 for audio?


Sorry! I did not notice this.

> 
> Please, reserve an ID with the Virtio TC before using it.

I thought I requested[1] ID 25.

[1] https://github.com/oasis-tcs/virtio-spec/issues/38
[2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html

In that case I will send request for next available ID i.e 26?

Thanks,
Pankaj


> 
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..fa3f7d52717a
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > +	__le64 start;
> > +	__le64 size;
> > +};
> > +#endif
> > --
> > 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 14:03     ` [Qemu-devel] " Pankaj Gupta
@ 2019-04-10 14:31       ` Cornelia Huck
  2019-04-10 16:46         ` Michael S. Tsirkin
  0 siblings, 1 reply; 38+ messages in thread
From: Cornelia Huck @ 2019-04-10 14:31 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Michael S. Tsirkin, jack, kvm, david, jasowang, david,
	qemu-devel, virtualization, adilger kernel, zwisler, aarcange,
	dave jiang, linux-nvdimm, vishal l verma, willy, hch, linux-acpi,
	jmoyer, linux-ext4, lenb, kilobyte, riel, yuval shaia, stefanha,
	pbonzini, dan j williams, lcapitulino, kwolf, nilal, tytso,
	xiaoguangrong eric, darrick wong, rjw, linux-kernel, linux-xfs,
	linux-fsdevel, imammedo

On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> > 
> > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:  
> > > This patch adds virtio-pmem driver for KVM guest.
> > > 
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > > 
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > > 
> > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>

> > > diff --git a/include/uapi/linux/virtio_ids.h
> > > b/include/uapi/linux/virtio_ids.h
> > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > --- a/include/uapi/linux/virtio_ids.h
> > > +++ b/include/uapi/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> > >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> > >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> > >  
> > >  #endif /* _LINUX_VIRTIO_IDS_H */  
> > 
> > Didn't Paolo point out someone is using 25 for audio?  
> 
> 
> Sorry! I did not notice this.
> 
> > 
> > Please, reserve an ID with the Virtio TC before using it.  
> 
> I thought I requested[1] ID 25.
> 
> [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
> 
> In that case I will send request for next available ID i.e 26?

Have the folks using this for audio sent a reservation request as well?
If not, I'd say the first requester should get the id...

(And yes, we need to be quicker with voting on/applying id
reservations :/)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 12:24   ` Cornelia Huck
@ 2019-04-10 14:38     ` Yuval Shaia
  2019-04-10 15:44       ` Pankaj Gupta
  2019-04-10 15:38     ` Pankaj Gupta
  1 sibling, 1 reply; 38+ messages in thread
From: Yuval Shaia @ 2019-04-10 14:38 UTC (permalink / raw)
  To: Cornelia Huck, pagupta
  Cc: Pankaj Gupta, linux-nvdimm, linux-kernel, virtualization, kvm,
	linux-fsdevel, linux-acpi, qemu-devel, linux-ext4, linux-xfs,
	dan.j.williams, zwisler, vishal.l.verma, dave.jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger.kernel,
	darrick.wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, xiaoguangrong.eric, pbonzini,
	kilobyte

On Wed, Apr 10, 2019 at 02:24:26PM +0200, Cornelia Huck wrote:
> On Wed, 10 Apr 2019 09:38:22 +0530
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
> > This patch adds virtio-pmem driver for KVM guest.
> > 
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> > 
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
> >  drivers/virtio/Kconfig           |  10 +++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
> >  include/linux/virtio_pmem.h      |  60 +++++++++++++++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  10 +++
> >  7 files changed, 294 insertions(+)
> >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> >  create mode 100644 drivers/virtio/pmem.c
> >  create mode 100644 include/linux/virtio_pmem.h
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> (...)
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +	{ 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> 
> IMHO, you don't gain much by splitting off this function...

It make sense to have all the vq-init-related stuff in one function, so
here pmem_lock and req_list are used only for the vq.
Saying that - maybe it would be better to have the 3 in one struct.

> 
> > +{
> > +	struct virtqueue *vq;
> > +
> > +	/* single vq */
> > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +				host_ack, "flush_queue");
> > +	if (IS_ERR(vq))
> > +		return PTR_ERR(vq);
> 
> I'm personally not a fan of chained assignments... I think I'd just
> drop the 'vq' variable and operate on vpmem->req_vq directly.

+1

> 
> > +
> > +	spin_lock_init(&vpmem->pmem_lock);
> > +	INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > +	return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +	int err = 0;
> > +	struct resource res;
> > +	struct virtio_pmem *vpmem;
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nd_region_desc ndr_desc = {};
> > +	int nid = dev_to_node(&vdev->dev);
> > +	struct nd_region *nd_region;
> > +
> > +	if (!vdev->config->get) {
> > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> 
> Maybe s/config disabled/config access disabled/ ? That seems to be the
> more common message.
> 
> > +			__func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +					GFP_KERNEL);
> 
> Here, the vpmem variable makes sense for convenience, but I'm again not
> a fan of the chaining :)

+1

> 
> > +	if (!vpmem) {
> > +		err = -ENOMEM;
> > +		goto out_err;
> > +	}
> > +
> > +	vpmem->vdev = vdev;
> > +	err = init_vq(vpmem);
> > +	if (err)
> > +		goto out_err;
> > +
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			start, &vpmem->start);
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			size, &vpmem->size);
> > +
> > +	res.start = vpmem->start;
> > +	res.end   = vpmem->start + vpmem->size-1;
> > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > +	vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > +						&vpmem->nd_desc);
> 
> And here :)
> 
> > +	if (!nvdimm_bus)
> > +		goto out_vq;
> > +
> > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > +	ndr_desc.res = &res;
> > +	ndr_desc.numa_node = nid;
> > +	ndr_desc.flush = virtio_pmem_flush;
> > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +	nd_region->provider_data =  dev_to_virtio
> > +					(nd_region->dev.parent->parent);
> 
> Isn't it clear that this parent chain will always end up at &vdev->dev?
> Maybe simply set ->provider_data to vdev directly? (Does it need to
> grab a reference count of the device, BTW?)
> 
> > +
> > +	if (!nd_region)
> > +		goto out_nd;
> 
> Probably better to do this check before you access nd_region's
> members :)
> 
> > +
> > +	return 0;
> > +out_nd:
> > +	err = -ENXIO;
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > +	vdev->config->del_vqs(vdev);
> > +out_err:
> > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > +	return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > +	struct virtio_pmem *vpmem = vdev->priv;
> > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > +	nvdimm_bus_unregister(nvdimm_bus);
> 
> I haven't followed this around the nvdimm code, but is the nd_region
> you created during probe cleaned up automatically, or would you need to
> do something here?
> 
> > +	vdev->config->del_vqs(vdev);
> > +	vdev->config->reset(vdev);
> > +	kfree(vpmem);
> 
> You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> remove?
> 
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +	.driver.name		= KBUILD_MODNAME,
> > +	.driver.owner		= THIS_MODULE,
> > +	.id_table		= id_table,
> > +	.probe			= virtio_pmem_probe,
> > +	.remove			= virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> 
> Only looked at this from the general virtio driver angle; seems fine
> apart from some easy-to-fix issues and some personal style preference
> things.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
  2019-04-10 12:24   ` Cornelia Huck
  2019-04-10 13:12   ` Michael S. Tsirkin
@ 2019-04-10 14:41   ` Yuval Shaia
  2 siblings, 0 replies; 38+ messages in thread
From: Yuval Shaia @ 2019-04-10 14:41 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan.j.williams,
	zwisler, vishal.l.verma, dave.jiang, mst, jasowang, willy, rjw,
	hch, lenb, jack, tytso, adilger.kernel, darrick.wong,
	lcapitulino, kwolf, imammedo, jmoyer, nilal, riel, stefanha,
	aarcange, david, david, cohuck, xiaoguangrong.eric, pbonzini,
	kilobyte

> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +	int err = 0;
> +	struct resource res;
> +	struct virtio_pmem *vpmem;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nd_region_desc ndr_desc = {};
> +	int nid = dev_to_node(&vdev->dev);
> +	struct nd_region *nd_region;
> +
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +					GFP_KERNEL);
> +	if (!vpmem) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	vpmem->vdev = vdev;
> +	err = init_vq(vpmem);
> +	if (err)
> +		goto out_err;
> +
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			start, &vpmem->start);
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			size, &vpmem->size);
> +
> +	res.start = vpmem->start;
> +	res.end   = vpmem->start + vpmem->size-1;
> +	vpmem->nd_desc.provider_name = "virtio-pmem";
> +	vpmem->nd_desc.module = THIS_MODULE;
> +
> +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> +						&vpmem->nd_desc);
> +	if (!nvdimm_bus)
> +		goto out_vq;
> +
> +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> +	ndr_desc.res = &res;
> +	ndr_desc.numa_node = nid;
> +	ndr_desc.flush = virtio_pmem_flush;
> +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +	nd_region->provider_data =  dev_to_virtio

Pleas delete the extra space.

> +					(nd_region->dev.parent->parent);
> +
> +	if (!nd_region)
> +		goto out_nd;
> +
> +	return 0;
> +out_nd:
> +	err = -ENXIO;
> +	nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> +	vdev->config->del_vqs(vdev);
> +out_err:
> +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> +	return err;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 12:24   ` Cornelia Huck
  2019-04-10 14:38     ` Yuval Shaia
@ 2019-04-10 15:38     ` Pankaj Gupta
  1 sibling, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10 15:38 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: linux-nvdimm, linux-kernel, virtualization, kvm, linux-fsdevel,
	linux-acpi, qemu-devel, linux-ext4, linux-xfs, dan j williams,
	zwisler, vishal l verma, dave jiang, mst, jasowang, willy, rjw,
	hch, lenb, jack, tytso, adilger kernel, darrick wong,
	lcapitulino, kwolf, imammedo, jmoyer, nilal, riel, stefanha,
	aarcange, david, david, xiaoguangrong eric, pbonzini, kilobyte,
	yuval shaia


> 
> > This patch adds virtio-pmem driver for KVM guest.
> > 
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> > 
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
> >  drivers/virtio/Kconfig           |  10 +++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
> >  include/linux/virtio_pmem.h      |  60 +++++++++++++++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  10 +++
> >  7 files changed, 294 insertions(+)
> >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> >  create mode 100644 drivers/virtio/pmem.c
> >  create mode 100644 include/linux/virtio_pmem.h
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> (...)
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +	{ 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> 
> IMHO, you don't gain much by splitting off this function...

o.k. I kept this for better code structure.

> 
> > +{
> > +	struct virtqueue *vq;
> > +
> > +	/* single vq */
> > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +				host_ack, "flush_queue");
> > +	if (IS_ERR(vq))
> > +		return PTR_ERR(vq);
> 
> I'm personally not a fan of chained assignments... I think I'd just
> drop the 'vq' variable and operate on vpmem->req_vq directly.

Sure.

> 
> > +
> > +	spin_lock_init(&vpmem->pmem_lock);
> > +	INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > +	return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +	int err = 0;
> > +	struct resource res;
> > +	struct virtio_pmem *vpmem;
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nd_region_desc ndr_desc = {};
> > +	int nid = dev_to_node(&vdev->dev);
> > +	struct nd_region *nd_region;
> > +
> > +	if (!vdev->config->get) {
> > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> 
> Maybe s/config disabled/config access disabled/ ? That seems to be the
> more common message.

This is better.

> 
> > +			__func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +					GFP_KERNEL);
> 
> Here, the vpmem variable makes sense for convenience, but I'm again not
> a fan of the chaining :)

Sure will change :)

> 
> > +	if (!vpmem) {
> > +		err = -ENOMEM;
> > +		goto out_err;
> > +	}
> > +
> > +	vpmem->vdev = vdev;
> > +	err = init_vq(vpmem);
> > +	if (err)
> > +		goto out_err;
> > +
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			start, &vpmem->start);
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			size, &vpmem->size);
> > +
> > +	res.start = vpmem->start;
> > +	res.end   = vpmem->start + vpmem->size-1;
> > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > +	vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > +						&vpmem->nd_desc);
> 
> And here :)

Sure.

> 
> > +	if (!nvdimm_bus)
> > +		goto out_vq;
> > +
> > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > +	ndr_desc.res = &res;
> > +	ndr_desc.numa_node = nid;
> > +	ndr_desc.flush = virtio_pmem_flush;
> > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +	nd_region->provider_data =  dev_to_virtio
> > +					(nd_region->dev.parent->parent);
> 
> Isn't it clear that this parent chain will always end up at &vdev->dev?

Yes, It will resolve to &vdev->dev.

> Maybe simply set ->provider_data to vdev directly? (Does it need to
> grab a reference count of the device, BTW?)

reference is already taken when registering the device with nvdimm_bus.

> 
> > +
> > +	if (!nd_region)
> > +		goto out_nd;
> 
> Probably better to do this check before you access nd_region's
> members :)

ah Sorry! Will correct this.

> 
> > +
> > +	return 0;
> > +out_nd:
> > +	err = -ENXIO;
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > +	vdev->config->del_vqs(vdev);
> > +out_err:
> > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > +	return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > +	struct virtio_pmem *vpmem = vdev->priv;
> > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > +	nvdimm_bus_unregister(nvdimm_bus);
> 
> I haven't followed this around the nvdimm code, but is the nd_region
> you created during probe cleaned up automatically, or would you need to
> do something here?

'nvdimm_bus_unregister' does that.
> 
> > +	vdev->config->del_vqs(vdev);
> > +	vdev->config->reset(vdev);
> > +	kfree(vpmem);
> 
> You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> remove?

yes. Will remove free(vpmem).

> 
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +	.driver.name		= KBUILD_MODNAME,
> > +	.driver.owner		= THIS_MODULE,
> > +	.id_table		= id_table,
> > +	.probe			= virtio_pmem_probe,
> > +	.remove			= virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> 
> Only looked at this from the general virtio driver angle; seems fine
> apart from some easy-to-fix issues and some personal style preference
> things.

Thank you for the review.

Best regards,
Pankaj
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 14:38     ` Yuval Shaia
@ 2019-04-10 15:44       ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-10 15:44 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: Cornelia Huck, linux-nvdimm, linux-kernel, virtualization, kvm,
	linux-fsdevel, linux-acpi, qemu-devel, linux-ext4, linux-xfs,
	dan j williams, zwisler, vishal l verma, dave jiang, mst,
	jasowang, willy, rjw, hch, lenb, jack, tytso, adilger kernel,
	darrick wong, lcapitulino, kwolf, imammedo, jmoyer, nilal, riel,
	stefanha, aarcange, david, david, xiaoguangrong eric, pbonzini,
	kilobyte


> > 
> > > This patch adds virtio-pmem driver for KVM guest.
> > > 
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > > 
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > > 
> > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > ---
> > >  drivers/nvdimm/virtio_pmem.c     |  88 ++++++++++++++++++++++
> > >  drivers/virtio/Kconfig           |  10 +++
> > >  drivers/virtio/Makefile          |   1 +
> > >  drivers/virtio/pmem.c            | 124 +++++++++++++++++++++++++++++++
> > >  include/linux/virtio_pmem.h      |  60 +++++++++++++++
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  10 +++
> > >  7 files changed, 294 insertions(+)
> > >  create mode 100644 drivers/nvdimm/virtio_pmem.c
> > >  create mode 100644 drivers/virtio/pmem.c
> > >  create mode 100644 include/linux/virtio_pmem.h
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > 
> > (...)
> > > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > > new file mode 100644
> > > index 000000000000..cc9de9589d56
> > > --- /dev/null
> > > +++ b/drivers/virtio/pmem.c
> > > @@ -0,0 +1,124 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and registers the virtual pmem device
> > > + * with libnvdimm core.
> > > + */
> > > +#include <linux/virtio_pmem.h>
> > > +#include <../../drivers/nvdimm/nd.h>
> > > +
> > > +static struct virtio_device_id id_table[] = {
> > > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > +	{ 0 },
> > > +};
> > > +
> > > + /* Initialize virt queue */
> > > +static int init_vq(struct virtio_pmem *vpmem)
> > 
> > IMHO, you don't gain much by splitting off this function...
> 
> It make sense to have all the vq-init-related stuff in one function, so
> here pmem_lock and req_list are used only for the vq.

Yes.

> Saying that - maybe it would be better to have the 3 in one struct.
> 
> > 
> > > +{
> > > +	struct virtqueue *vq;
> > > +
> > > +	/* single vq */
> > > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > +				host_ack, "flush_queue");
> > > +	if (IS_ERR(vq))
> > > +		return PTR_ERR(vq);
> > 
> > I'm personally not a fan of chained assignments... I think I'd just
> > drop the 'vq' variable and operate on vpmem->req_vq directly.
> 
> +1

Will drop extra vq.

> 
> > 
> > > +
> > > +	spin_lock_init(&vpmem->pmem_lock);
> > > +	INIT_LIST_HEAD(&vpmem->req_list);
> > > +
> > > +	return 0;
> > > +};
> > > +
> > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > +{
> > > +	int err = 0;
> > > +	struct resource res;
> > > +	struct virtio_pmem *vpmem;
> > > +	struct nvdimm_bus *nvdimm_bus;
> > > +	struct nd_region_desc ndr_desc = {};
> > > +	int nid = dev_to_node(&vdev->dev);
> > > +	struct nd_region *nd_region;
> > > +
> > > +	if (!vdev->config->get) {
> > > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > 
> > Maybe s/config disabled/config access disabled/ ? That seems to be the
> > more common message.
> > 
> > > +			__func__);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > +					GFP_KERNEL);
> > 
> > Here, the vpmem variable makes sense for convenience, but I'm again not
> > a fan of the chaining :)
> 
> +1

here as well.

> 
> > 
> > > +	if (!vpmem) {
> > > +		err = -ENOMEM;
> > > +		goto out_err;
> > > +	}
> > > +
> > > +	vpmem->vdev = vdev;
> > > +	err = init_vq(vpmem);
> > > +	if (err)
> > > +		goto out_err;
> > > +
> > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > +			start, &vpmem->start);
> > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > +			size, &vpmem->size);
> > > +
> > > +	res.start = vpmem->start;
> > > +	res.end   = vpmem->start + vpmem->size-1;
> > > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > > +	vpmem->nd_desc.module = THIS_MODULE;
> > > +
> > > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > +						&vpmem->nd_desc);
> > 
> > And here :)
> > 
> > > +	if (!nvdimm_bus)
> > > +		goto out_vq;
> > > +
> > > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > +
> > > +	ndr_desc.res = &res;
> > > +	ndr_desc.numa_node = nid;
> > > +	ndr_desc.flush = virtio_pmem_flush;
> > > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > +	set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > +	nd_region->provider_data =  dev_to_virtio
> > > +					(nd_region->dev.parent->parent);
> > 
> > Isn't it clear that this parent chain will always end up at &vdev->dev?
> > Maybe simply set ->provider_data to vdev directly? (Does it need to
> > grab a reference count of the device, BTW?)
> > 
> > > +
> > > +	if (!nd_region)
> > > +		goto out_nd;
> > 
> > Probably better to do this check before you access nd_region's
> > members :)
> > 
> > > +
> > > +	return 0;
> > > +out_nd:
> > > +	err = -ENXIO;
> > > +	nvdimm_bus_unregister(nvdimm_bus);
> > > +out_vq:
> > > +	vdev->config->del_vqs(vdev);
> > > +out_err:
> > > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > +	return err;
> > > +}
> > > +
> > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > +{
> > > +	struct virtio_pmem *vpmem = vdev->priv;
> > > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > +
> > > +	nvdimm_bus_unregister(nvdimm_bus);
> > 
> > I haven't followed this around the nvdimm code, but is the nd_region
> > you created during probe cleaned up automatically, or would you need to
> > do something here?
> > 
> > > +	vdev->config->del_vqs(vdev);
> > > +	vdev->config->reset(vdev);
> > > +	kfree(vpmem);
> > 
> > You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> > remove?
> > 
> > > +}
> > > +
> > > +static struct virtio_driver virtio_pmem_driver = {
> > > +	.driver.name		= KBUILD_MODNAME,
> > > +	.driver.owner		= THIS_MODULE,
> > > +	.id_table		= id_table,
> > > +	.probe			= virtio_pmem_probe,
> > > +	.remove			= virtio_pmem_remove,
> > > +};
> > > +
> > > +module_virtio_driver(virtio_pmem_driver);
> > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > +MODULE_LICENSE("GPL");
> > 
> > Only looked at this from the general virtio driver angle; seems fine
> > apart from some easy-to-fix issues and some personal style preference
> > things.
> 

Best regards,
Pankaj

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 14:31       ` Cornelia Huck
@ 2019-04-10 16:46         ` Michael S. Tsirkin
  2019-04-10 16:52           ` Cornelia Huck
  0 siblings, 1 reply; 38+ messages in thread
From: Michael S. Tsirkin @ 2019-04-10 16:46 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Pankaj Gupta, jack, kvm, david, jasowang, david, qemu-devel,
	virtualization, adilger kernel, zwisler, aarcange, dave jiang,
	linux-nvdimm, vishal l verma, willy, hch, linux-acpi, jmoyer,
	linux-ext4, lenb, kilobyte, riel, yuval shaia, stefanha,
	pbonzini, dan j williams, lcapitulino, kwolf, nilal, tytso,
	xiaoguangrong eric, darrick wong, rjw, linux-kernel, linux-xfs,
	linux-fsdevel, imammedo

On Wed, Apr 10, 2019 at 04:31:39PM +0200, Cornelia Huck wrote:
> On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
> > > 
> > > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:  
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > > 
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > > 
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > > 
> > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> 
> > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > b/include/uapi/linux/virtio_ids.h
> > > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > > --- a/include/uapi/linux/virtio_ids.h
> > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > @@ -43,5 +43,6 @@
> > > >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> > > >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> > > >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > > > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> > > >  
> > > >  #endif /* _LINUX_VIRTIO_IDS_H */  
> > > 
> > > Didn't Paolo point out someone is using 25 for audio?  
> > 
> > 
> > Sorry! I did not notice this.
> > 
> > > 
> > > Please, reserve an ID with the Virtio TC before using it.  
> > 
> > I thought I requested[1] ID 25.
> > 
> > [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> > [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
> > 
> > In that case I will send request for next available ID i.e 26?
> 
> Have the folks using this for audio sent a reservation request as well?
> If not, I'd say the first requester should get the id...

No but I think they ship their's already. No one ships pmem
so it's less pain for everyone if we just skip 25.

> (And yes, we need to be quicker with voting on/applying id
> reservations :/)

We can't vote on what was not proposed for a vote.
At the moment that responsibility is with the commented
once discussion on the comment has taken place.

I think what's missing is a description of the process
in the README. Want to write it up and post it?

-- 
MST

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver
  2019-04-10 16:46         ` Michael S. Tsirkin
@ 2019-04-10 16:52           ` Cornelia Huck
  0 siblings, 0 replies; 38+ messages in thread
From: Cornelia Huck @ 2019-04-10 16:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pankaj Gupta, jack, kvm, david, jasowang, david, qemu-devel,
	virtualization, adilger kernel, zwisler, aarcange, dave jiang,
	linux-nvdimm, vishal l verma, willy, hch, linux-acpi, jmoyer,
	linux-ext4, lenb, kilobyte, riel, yuval shaia, stefanha,
	pbonzini, dan j williams, lcapitulino, kwolf, nilal, tytso,
	xiaoguangrong eric, darrick wong, rjw, linux-kernel, linux-xfs,
	linux-fsdevel, imammedo

On Wed, 10 Apr 2019 12:46:12 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Apr 10, 2019 at 04:31:39PM +0200, Cornelia Huck wrote:
> > On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
> > Pankaj Gupta <pagupta@redhat.com> wrote:
> >   
> > > > 
> > > > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:    
> > > > > This patch adds virtio-pmem driver for KVM guest.
> > > > > 
> > > > > Guest reads the persistent memory range information from
> > > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > > creates a nd_region object with the persistent memory
> > > > > range information so that existing 'nvdimm/pmem' driver
> > > > > can reserve this into system memory map. This way
> > > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > > driver to register persistent memory compatible for DAX
> > > > > capable filesystems.
> > > > > 
> > > > > This also provides function to perform guest flush over
> > > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > > on DAX memory range.
> > > > > 
> > > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>  
> >   
> > > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > > b/include/uapi/linux/virtio_ids.h
> > > > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > > > --- a/include/uapi/linux/virtio_ids.h
> > > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > > @@ -43,5 +43,6 @@
> > > > >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> > > > >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> > > > >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > > > > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> > > > >  
> > > > >  #endif /* _LINUX_VIRTIO_IDS_H */    
> > > > 
> > > > Didn't Paolo point out someone is using 25 for audio?    
> > > 
> > > 
> > > Sorry! I did not notice this.
> > >   
> > > > 
> > > > Please, reserve an ID with the Virtio TC before using it.    
> > > 
> > > I thought I requested[1] ID 25.
> > > 
> > > [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> > > [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
> > > 
> > > In that case I will send request for next available ID i.e 26?  
> > 
> > Have the folks using this for audio sent a reservation request as well?
> > If not, I'd say the first requester should get the id...  
> 
> No but I think they ship their's already. No one ships pmem
> so it's less pain for everyone if we just skip 25.

Ugh. Ok, then we should change pmem...

> 
> > (And yes, we need to be quicker with voting on/applying id
> > reservations :/)  
> 
> We can't vote on what was not proposed for a vote.
> At the moment that responsibility is with the commented
> once discussion on the comment has taken place.
> 
> I think what's missing is a description of the process
> in the README. Want to write it up and post it?

I can add that to my TODO list, can't promise speedy resolution, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-10  4:08 ` [PATCH v5 1/6] libnvdimm: nd_region flush callback support Pankaj Gupta
@ 2019-04-11 14:51   ` Dan Williams
  2019-04-11 15:57     ` [Qemu-devel] " Pankaj Gupta
  2019-04-12  8:32     ` Jan Kara
  0 siblings, 2 replies; 38+ messages in thread
From: Dan Williams @ 2019-04-11 14:51 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, Linux Kernel Mailing List, virtualization,
	KVM list, linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Christoph Hellwig, Len Brown, Jan Kara,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, jmoyer, Nitesh Narayan Lal,
	Rik van Riel, Stefan Hajnoczi, Andrea Arcangeli,
	David Hildenbrand, david, cohuck, Xiao Guangrong, Paolo Bonzini,
	kilobyte, yuval.shaia

On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> This also handles asynchronous flush requests from the block layer
> by creating a child bio and chaining it with parent bio.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
[..]
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index b4ef7d9ff22e..fb1041ab32a6 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
>                 return rc;
>         if (!flush)
>                 return -EINVAL;
> -       nvdimm_flush(nd_region);
> +       rc = nvdimm_flush(nd_region, NULL, false);
> +       if (rc)
> +               return rc;
>
>         return len;
>  }
> @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
>         dev->of_node = ndr_desc->of_node;
>         nd_region->ndr_size = resource_size(ndr_desc->res);
>         nd_region->ndr_start = ndr_desc->res->start;
> +       if (ndr_desc->flush)
> +               nd_region->flush = ndr_desc->flush;
> +       else
> +               nd_region->flush = generic_nvdimm_flush;
> +
>         nd_device_register(dev);
>
>         return nd_region;
> @@ -1125,11 +1132,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
>  }
>  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
>
> +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> +{

I don't quite see the point of the 'async' argument. All the usages of
this routine are either

nvdimm_flush(nd_region, bio, true)
...or:
nvdimm_flush(nd_region, NULL, false)

...so why not gate async behavior on the presence of the 'bio' argument?


> +       int rc = 0;
> +
> +       /* Create child bio for asynchronous flush and chain with
> +        * parent bio. Otherwise directly call nd_region flush.
> +        */
> +       if (async && bio->bi_iter.bi_sector != -1) {
> +
> +               struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> +
> +               if (!child)
> +                       return -ENOMEM;
> +               bio_copy_dev(child, bio);
> +               child->bi_opf = REQ_PREFLUSH;
> +               child->bi_iter.bi_sector = -1;
> +               bio_chain(child, bio);
> +               submit_bio(child);

I understand how this works, but it's a bit too "magical" for my
taste. I would prefer that all flush implementations take an optional
'bio' argument rather than rely on the make_request implementation to
stash the bio away on a driver specific list.

> +       } else {
> +               if (nd_region->flush(nd_region))
> +                       rc = -EIO;

Given the common case wants to be fast and synchronous I think we
should try to avoid retpoline overhead by default. So something like
this:

if (nd_region->flush == generic_nvdimm_flush)
    rc = generic_nvdimm_flush(...);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
  2019-04-10  4:08 ` [PATCH v5 3/6] libnvdimm: add dax_dev sync flag Pankaj Gupta
  2019-04-10  8:28   ` Jan Kara
@ 2019-04-11 14:56   ` Dan Williams
  2019-04-11 15:39     ` Pankaj Gupta
  1 sibling, 1 reply; 38+ messages in thread
From: Dan Williams @ 2019-04-11 14:56 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-nvdimm, Linux Kernel Mailing List, virtualization,
	KVM list, linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Christoph Hellwig, Len Brown, Jan Kara,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, jmoyer, Nitesh Narayan Lal,
	Rik van Riel, Stefan Hajnoczi, Andrea Arcangeli,
	David Hildenbrand, david, cohuck, Xiao Guangrong, Paolo Bonzini,
	kilobyte, yuval.shaia

On Tue, Apr 9, 2019 at 9:10 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/dax/bus.c            |  2 +-
>  drivers/dax/super.c          | 13 ++++++++++++-
>  drivers/md/dm.c              |  2 +-
>  drivers/nvdimm/pmem.c        |  3 ++-
>  drivers/nvdimm/region_devs.c |  7 +++++++
>  include/linux/dax.h          |  9 +++++++--
>  include/linux/libnvdimm.h    |  1 +
>  7 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 2109cfe80219..431bf7d2a7f9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
>          * No 'host' or dax_operations since there is no access to this
>          * device outside of mmap of the resulting character device.
>          */
> -       dax_dev = alloc_dax(dev_dax, NULL, NULL);
> +       dax_dev = alloc_dax(dev_dax, NULL, NULL, true);

I find apis that take a boolean as unreadable. What does 'true' mean?
It wastes time to go look at the function definition vs something
like:

    alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 3/6] libnvdimm: add dax_dev sync flag
  2019-04-11 14:56   ` Dan Williams
@ 2019-04-11 15:39     ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-11 15:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Linux Kernel Mailing List, virtualization,
	KVM list, linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Christoph Hellwig, Len Brown, Jan Kara,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, jmoyer, Nitesh Narayan Lal,
	Rik van Riel, Stefan Hajnoczi, Andrea Arcangeli,
	David Hildenbrand, david, cohuck, Xiao Guangrong, Paolo Bonzini,
	kilobyte, yuval shaia


Hi Dan,

Thank you for the review.

> >
> > This patch adds 'DAXDEV_SYNC' flag which is set
> > for nd_region doing synchronous flush. This later
> > is used to disable MAP_SYNC functionality for
> > ext4 & xfs filesystem for devices don't support
> > synchronous flush.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/dax/bus.c            |  2 +-
> >  drivers/dax/super.c          | 13 ++++++++++++-
> >  drivers/md/dm.c              |  2 +-
> >  drivers/nvdimm/pmem.c        |  3 ++-
> >  drivers/nvdimm/region_devs.c |  7 +++++++
> >  include/linux/dax.h          |  9 +++++++--
> >  include/linux/libnvdimm.h    |  1 +
> >  7 files changed, 31 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 2109cfe80219..431bf7d2a7f9 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region
> > *dax_region, int id,
> >          * No 'host' or dax_operations since there is no access to this
> >          * device outside of mmap of the resulting character device.
> >          */
> > -       dax_dev = alloc_dax(dev_dax, NULL, NULL);
> > +       dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
> 
> I find apis that take a boolean as unreadable. What does 'true' mean?
> It wastes time to go look at the function definition vs something
> like:
> 
>     alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);

Agree. Will change as suggested.

Best regards,
Pankaj
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-11 14:51   ` Dan Williams
@ 2019-04-11 15:57     ` Pankaj Gupta
  2019-04-11 16:09       ` Dan Williams
  2019-04-12  8:32     ` Jan Kara
  1 sibling, 1 reply; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-11 15:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: cohuck, Jan Kara, KVM list, Michael S. Tsirkin, Jason Wang,
	david, Qemu Developers, virtualization, Andreas Dilger,
	Ross Zwisler, Andrea Arcangeli, Dave Jiang, linux-nvdimm,
	Vishal L Verma, David Hildenbrand, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, jmoyer, linux-ext4, Len Brown,
	kilobyte, Rik van Riel, yuval shaia, Stefan Hajnoczi,
	Paolo Bonzini, lcapitulino, Kevin Wolf, Nitesh Narayan Lal,
	Theodore Ts'o, Xiao Guangrong, Darrick J. Wong,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-xfs,
	linux-fsdevel, Igor Mammedov



> >
> > This patch adds functionality to perform flush from guest
> > to host over VIRTIO. We are registering a callback based
> > on 'nd_region' type. virtio_pmem driver requires this special
> > flush function. For rest of the region types we are registering
> > existing flush function. Report error returned by host fsync
> > failure to userspace.
> >
> > This also handles asynchronous flush requests from the block layer
> > by creating a child bio and chaining it with parent bio.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---bio_chain Dan williams
> [..]
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index b4ef7d9ff22e..fb1041ab32a6 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > struct device_attribute *att
> >                 return rc;
> >         if (!flush)
> >                 return -EINVAL;
> > -       nvdimm_flush(nd_region);
> > +       rc = nvdimm_flush(nd_region, NULL, false);
> > +       if (rc)
> > +               return rc;
> >
> >         return len;
> >  }
> > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > nvdimm_bus *nvdimm_bus,
> >         dev->of_node = ndr_desc->of_node;
> >         nd_region->ndr_size = resource_size(ndr_desc->res);
> >         nd_region->ndr_start = ndr_desc->res->start;
> > +       if (ndr_desc->flush)
> > +               nd_region->flush = ndr_desc->flush;
> > +       else
> > +               nd_region->flush = generic_nvdimm_flush;
> > +
> >         nd_device_register(dev);
> >
> >         return nd_region;
> > @@ -1125,11 +1132,36 @@ struct nd_region
> > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> >  }
> >  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> >
> > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> > +{
> 
> I don't quite see the point of the 'async' argument. All the usages of
> this routine are either
> 
> nvdimm_flush(nd_region, bio, true)
> ...or:
> nvdimm_flush(nd_region, NULL, false)

Agree.

> 
> ...so why not gate async behavior on the presence of the 'bio' argument?

Sure.

> 
> 
> > +       int rc = 0;
> > +
> > +       /* Create child bio for asynchronous flush and chain with
> > +        * parent bio. Otherwise directly call nd_region flush.
> > +        */
> > +       if (async && bio->bi_iter.bi_sector != -1) {
> > +
> > +               struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > +
> > +               if (!child)
> > +                       return -ENOMEM;
> > +               bio_copy_dev(child, bio);
> > +               child->bi_opf = REQ_PREFLUSH;
> > +               child->bi_iter.bi_sector = -1;
> > +               bio_chain(child, bio);
> > +               submit_bio(child);
> 
> I understand how this works, but it's a bit too "magical" for my
> taste. I would prefer that all flush implementations take an optional
> 'bio' argument rather than rely on the make_request implementation to
> stash the bio away on a driver specific list.

I did this to make use of "bio_chain" for chaining child bio for async flush
suggested [1]. Are you saying to remove this and just call "flush" based on 
bio argument? Or I implemented the 'bio_chain' request entirely wrong?

[1] https://lkml.org/lkml/2018/9/27/1028

> 
> > +       } else {
> > +               if (nd_region->flush(nd_region))
> > +                       rc = -EIO;
> 
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
> 
> if (nd_region->flush == generic_nvdimm_flush)
>     rc = generic_nvdimm_flush(...);

Sure.

Thanks,
Pankaj
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-11 15:57     ` [Qemu-devel] " Pankaj Gupta
@ 2019-04-11 16:09       ` Dan Williams
  2019-04-11 16:23         ` Pankaj Gupta
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2019-04-11 16:09 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: cohuck, Jan Kara, KVM list, Michael S. Tsirkin, Jason Wang,
	david, Qemu Developers, virtualization, Andreas Dilger,
	Ross Zwisler, Andrea Arcangeli, Dave Jiang, linux-nvdimm,
	Vishal L Verma, David Hildenbrand, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, jmoyer, linux-ext4, Len Brown,
	kilobyte, Rik van Riel, yuval shaia, Stefan Hajnoczi,
	Paolo Bonzini, lcapitulino, Kevin Wolf, Nitesh Narayan Lal,
	Theodore Ts'o, Xiao Guangrong, Darrick J. Wong,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-xfs,
	linux-fsdevel, Igor Mammedov

On Thu, Apr 11, 2019 at 9:02 AM Pankaj Gupta <pagupta@redhat.com> wrote:
>
>
>
> > >
> > > This patch adds functionality to perform flush from guest
> > > to host over VIRTIO. We are registering a callback based
> > > on 'nd_region' type. virtio_pmem driver requires this special
> > > flush function. For rest of the region types we are registering
> > > existing flush function. Report error returned by host fsync
> > > failure to userspace.
> > >
> > > This also handles asynchronous flush requests from the block layer
> > > by creating a child bio and chaining it with parent bio.
> > >
> > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > ---bio_chain Dan williams
> > [..]
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index b4ef7d9ff22e..fb1041ab32a6 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > > struct device_attribute *att
> > >                 return rc;
> > >         if (!flush)
> > >                 return -EINVAL;
> > > -       nvdimm_flush(nd_region);
> > > +       rc = nvdimm_flush(nd_region, NULL, false);
> > > +       if (rc)
> > > +               return rc;
> > >
> > >         return len;
> > >  }
> > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > > nvdimm_bus *nvdimm_bus,
> > >         dev->of_node = ndr_desc->of_node;
> > >         nd_region->ndr_size = resource_size(ndr_desc->res);
> > >         nd_region->ndr_start = ndr_desc->res->start;
> > > +       if (ndr_desc->flush)
> > > +               nd_region->flush = ndr_desc->flush;
> > > +       else
> > > +               nd_region->flush = generic_nvdimm_flush;
> > > +
> > >         nd_device_register(dev);
> > >
> > >         return nd_region;
> > > @@ -1125,11 +1132,36 @@ struct nd_region
> > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > >  }
> > >  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > >
> > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> > > +{
> >
> > I don't quite see the point of the 'async' argument. All the usages of
> > this routine are either
> >
> > nvdimm_flush(nd_region, bio, true)
> > ...or:
> > nvdimm_flush(nd_region, NULL, false)
>
> Agree.
>
> >
> > ...so why not gate async behavior on the presence of the 'bio' argument?
>
> Sure.
>
> >
> >
> > > +       int rc = 0;
> > > +
> > > +       /* Create child bio for asynchronous flush and chain with
> > > +        * parent bio. Otherwise directly call nd_region flush.
> > > +        */
> > > +       if (async && bio->bi_iter.bi_sector != -1) {
> > > +
> > > +               struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > +
> > > +               if (!child)
> > > +                       return -ENOMEM;
> > > +               bio_copy_dev(child, bio);
> > > +               child->bi_opf = REQ_PREFLUSH;
> > > +               child->bi_iter.bi_sector = -1;
> > > +               bio_chain(child, bio);
> > > +               submit_bio(child);
> >
> > I understand how this works, but it's a bit too "magical" for my
> > taste. I would prefer that all flush implementations take an optional
> > 'bio' argument rather than rely on the make_request implementation to
> > stash the bio away on a driver specific list.
>
> I did this to make use of "bio_chain" for chaining child bio for async flush
> suggested [1]. Are you saying to remove this and just call "flush" based on
> bio argument? Or I implemented the 'bio_chain' request entirely wrong?

No, I think you implemented it correctly. I'm just asking for the
chaining to be performed internal to the ->flush() callback rather
than in the common nvdimm_flush() front-end.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-11 16:09       ` Dan Williams
@ 2019-04-11 16:23         ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-11 16:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: cohuck, Jan Kara, KVM list, Michael S. Tsirkin, Jason Wang,
	david, Qemu Developers, virtualization, Andreas Dilger,
	Ross Zwisler, Andrea Arcangeli, Dave Jiang, linux-nvdimm,
	Vishal L Verma, David Hildenbrand, Matthew Wilcox,
	Christoph Hellwig, Linux ACPI, jmoyer, linux-ext4, Len Brown,
	kilobyte, Rik van Riel, yuval shaia, Stefan Hajnoczi,
	Paolo Bonzini, lcapitulino, Kevin Wolf, Nitesh Narayan Lal,
	Theodore Ts'o, Xiao Guangrong, Darrick J. Wong,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-xfs,
	linux-fsdevel, Igor Mammedov


> > > > This patch adds functionality to perform flush from guest
> > > > to host over VIRTIO. We are registering a callback based
> > > > on 'nd_region' type. virtio_pmem driver requires this special
> > > > flush function. For rest of the region types we are registering
> > > > existing flush function. Report error returned by host fsync
> > > > failure to userspace.
> > > >
> > > > This also handles asynchronous flush requests from the block layer
> > > > by creating a child bio and chaining it with parent bio.
> > > >
> > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > > ---bio_chain Dan williams
> > > [..]
> > > > diff --git a/drivers/nvdimm/region_devs.c
> > > > b/drivers/nvdimm/region_devs.c
> > > > index b4ef7d9ff22e..fb1041ab32a6 100644
> > > > --- a/drivers/nvdimm/region_devs.c
> > > > +++ b/drivers/nvdimm/region_devs.c
> > > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > > > struct device_attribute *att
> > > >                 return rc;
> > > >         if (!flush)
> > > >                 return -EINVAL;
> > > > -       nvdimm_flush(nd_region);
> > > > +       rc = nvdimm_flush(nd_region, NULL, false);
> > > > +       if (rc)
> > > > +               return rc;
> > > >
> > > >         return len;
> > > >  }
> > > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > > > nvdimm_bus *nvdimm_bus,
> > > >         dev->of_node = ndr_desc->of_node;
> > > >         nd_region->ndr_size = resource_size(ndr_desc->res);
> > > >         nd_region->ndr_start = ndr_desc->res->start;
> > > > +       if (ndr_desc->flush)
> > > > +               nd_region->flush = ndr_desc->flush;
> > > > +       else
> > > > +               nd_region->flush = generic_nvdimm_flush;
> > > > +
> > > >         nd_device_register(dev);
> > > >
> > > >         return nd_region;
> > > > @@ -1125,11 +1132,36 @@ struct nd_region
> > > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > > >
> > > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool
> > > > async)
> > > > +{
> > >
> > > I don't quite see the point of the 'async' argument. All the usages of
> > > this routine are either
> > >
> > > nvdimm_flush(nd_region, bio, true)
> > > ...or:
> > > nvdimm_flush(nd_region, NULL, false)
> >
> > Agree.
> >
> > >
> > > ...so why not gate async behavior on the presence of the 'bio' argument?
> >
> > Sure.
> >
> > >
> > >
> > > > +       int rc = 0;
> > > > +
> > > > +       /* Create child bio for asynchronous flush and chain with
> > > > +        * parent bio. Otherwise directly call nd_region flush.
> > > > +        */
> > > > +       if (async && bio->bi_iter.bi_sector != -1) {
> > > > +
> > > > +               struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > > +
> > > > +               if (!child)
> > > > +                       return -ENOMEM;
> > > > +               bio_copy_dev(child, bio);
> > > > +               child->bi_opf = REQ_PREFLUSH;
> > > > +               child->bi_iter.bi_sector = -1;
> > > > +               bio_chain(child, bio);
> > > > +               submit_bio(child);
> > >
> > > I understand how this works, but it's a bit too "magical" for my
> > > taste. I would prefer that all flush implementations take an optional
> > > 'bio' argument rather than rely on the make_request implementation to
> > > stash the bio away on a driver specific list.
> >
> > I did this to make use of "bio_chain" for chaining child bio for async
> > flush
> > suggested [1]. Are you saying to remove this and just call "flush" based on
> > bio argument? Or I implemented the 'bio_chain' request entirely wrong?
> 
> No, I think you implemented it correctly. I'm just asking for the
> chaining to be performed internal to the ->flush() callback rather
> than in the common nvdimm_flush() front-end.

Sure. Perfect!

Thank you very much for all the suggestions. 

Best regards,
Pankaj
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-11 14:51   ` Dan Williams
  2019-04-11 15:57     ` [Qemu-devel] " Pankaj Gupta
@ 2019-04-12  8:32     ` Jan Kara
  2019-04-12 13:12       ` Jeff Moyer
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2019-04-12  8:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pankaj Gupta, linux-nvdimm, Linux Kernel Mailing List,
	virtualization, KVM list, linux-fsdevel, Linux ACPI,
	Qemu Developers, linux-ext4, linux-xfs, Ross Zwisler,
	Vishal L Verma, Dave Jiang, Michael S. Tsirkin, Jason Wang,
	Matthew Wilcox, Rafael J. Wysocki, Christoph Hellwig, Len Brown,
	Jan Kara, Theodore Ts'o, Andreas Dilger, Darrick J. Wong,
	lcapitulino, Kevin Wolf, Igor Mammedov, jmoyer,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval.shaia

On Thu 11-04-19 07:51:48, Dan Williams wrote:
> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
> > +       } else {
> > +               if (nd_region->flush(nd_region))
> > +                       rc = -EIO;
> 
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
> 
> if (nd_region->flush == generic_nvdimm_flush)
>     rc = generic_nvdimm_flush(...);

I'd either add a comment about avoiding retpoline overhead here or just
make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
get confused by the code.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-12  8:32     ` Jan Kara
@ 2019-04-12 13:12       ` Jeff Moyer
  2019-04-18  6:27         ` [Qemu-devel] " Pankaj Gupta
  2019-04-18 16:05         ` Dan Williams
  0 siblings, 2 replies; 38+ messages in thread
From: Jeff Moyer @ 2019-04-12 13:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Christoph Hellwig, Len Brown,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, Nitesh Narayan Lal, Rik van Riel,
	Stefan Hajnoczi, Andrea Arcangeli, David Hildenbrand, david,
	cohuck, Xiao Guangrong, Paolo Bonzini, kilobyte, yuval.shaia

Jan Kara <jack@suse.cz> writes:

> On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>> > +       } else {
>> > +               if (nd_region->flush(nd_region))
>> > +                       rc = -EIO;
>> 
>> Given the common case wants to be fast and synchronous I think we
>> should try to avoid retpoline overhead by default. So something like
>> this:
>> 
>> if (nd_region->flush == generic_nvdimm_flush)
>>     rc = generic_nvdimm_flush(...);
>
> I'd either add a comment about avoiding retpoline overhead here or just
> make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> get confused by the code.

Isn't this premature optimization?  I really don't like adding things
like this without some numbers to show it's worth it.

-Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-12 13:12       ` Jeff Moyer
@ 2019-04-18  6:27         ` Pankaj Gupta
  2019-04-18 16:05         ` Dan Williams
  1 sibling, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-18  6:27 UTC (permalink / raw)
  To: Jeff Moyer, Jan Kara, Dan Williams
  Cc: cohuck, KVM list, Michael S. Tsirkin, Jason Wang, david,
	Qemu Developers, virtualization, Andreas Dilger, Ross Zwisler,
	Andrea Arcangeli, Dave Jiang, linux-nvdimm, Vishal L Verma,
	David Hildenbrand, Matthew Wilcox, Christoph Hellwig, Linux ACPI,
	linux-ext4, Len Brown, kilobyte, Rik van Riel, yuval shaia,
	Stefan Hajnoczi, Paolo Bonzini, lcapitulino, Kevin Wolf,
	Nitesh Narayan Lal, Theodore Ts'o, Xiao Guangrong,
	Darrick J. Wong, Rafael J. Wysocki, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Igor Mammedov


Hello,

Thank you for the suggestions on this.

> 
> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
> >> > +       } else {
> >> > +               if (nd_region->flush(nd_region))
> >> > +                       rc = -EIO;
> >> 
> >> Given the common case wants to be fast and synchronous I think we
> >> should try to avoid retpoline overhead by default. So something like
> >> this:
> >> 
> >> if (nd_region->flush == generic_nvdimm_flush)
> >>     rc = generic_nvdimm_flush(...);
> >
> > I'd either add a comment about avoiding retpoline overhead here or just
> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > get confused by the code.
> 
> Isn't this premature optimization?  I really don't like adding things
> like this without some numbers to show it's worth it.

Can we add the optimization after this version is merged.

Best regards,
Pankaj 

> 
> -Jeff
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-12 13:12       ` Jeff Moyer
  2019-04-18  6:27         ` [Qemu-devel] " Pankaj Gupta
@ 2019-04-18 16:05         ` Dan Williams
  2019-04-18 16:10           ` Jeff Moyer
  2019-04-18 16:18           ` Christoph Hellwig
  1 sibling, 2 replies; 38+ messages in thread
From: Dan Williams @ 2019-04-18 16:05 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Pankaj Gupta, linux-nvdimm, Linux Kernel Mailing List,
	virtualization, KVM list, linux-fsdevel, Linux ACPI,
	Qemu Developers, linux-ext4, linux-xfs, Ross Zwisler,
	Vishal L Verma, Dave Jiang, Michael S. Tsirkin, Jason Wang,
	Matthew Wilcox, Rafael J. Wysocki, Christoph Hellwig, Len Brown,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, Nitesh Narayan Lal, Rik van Riel,
	Stefan Hajnoczi, Andrea Arcangeli, David Hildenbrand, david,
	cohuck, Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Jan Kara <jack@suse.cz> writes:
>
> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
> >> > +       } else {
> >> > +               if (nd_region->flush(nd_region))
> >> > +                       rc = -EIO;
> >>
> >> Given the common case wants to be fast and synchronous I think we
> >> should try to avoid retpoline overhead by default. So something like
> >> this:
> >>
> >> if (nd_region->flush == generic_nvdimm_flush)
> >>     rc = generic_nvdimm_flush(...);
> >
> > I'd either add a comment about avoiding retpoline overhead here or just
> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > get confused by the code.
>
> Isn't this premature optimization?  I really don't like adding things
> like this without some numbers to show it's worth it.

I don't think it's premature given this optimization technique is
already being deployed elsewhere, see:

https://lwn.net/Articles/774347/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-18 16:05         ` Dan Williams
@ 2019-04-18 16:10           ` Jeff Moyer
  2019-04-18 16:18           ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Moyer @ 2019-04-18 16:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Pankaj Gupta, linux-nvdimm, Linux Kernel Mailing List,
	virtualization, KVM list, linux-fsdevel, Linux ACPI,
	Qemu Developers, linux-ext4, linux-xfs, Ross Zwisler,
	Vishal L Verma, Dave Jiang, Michael S. Tsirkin, Jason Wang,
	Matthew Wilcox, Rafael J. Wysocki, Christoph Hellwig, Len Brown,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, Nitesh Narayan Lal, Rik van Riel,
	Stefan Hajnoczi, Andrea Arcangeli, David Hildenbrand, david,
	cohuck, Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

Dan Williams <dan.j.williams@intel.com> writes:

> On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Jan Kara <jack@suse.cz> writes:
>>
>> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>> >> > +       } else {
>> >> > +               if (nd_region->flush(nd_region))
>> >> > +                       rc = -EIO;
>> >>
>> >> Given the common case wants to be fast and synchronous I think we
>> >> should try to avoid retpoline overhead by default. So something like
>> >> this:
>> >>
>> >> if (nd_region->flush == generic_nvdimm_flush)
>> >>     rc = generic_nvdimm_flush(...);
>> >
>> > I'd either add a comment about avoiding retpoline overhead here or just
>> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> > get confused by the code.
>>
>> Isn't this premature optimization?  I really don't like adding things
>> like this without some numbers to show it's worth it.
>
> I don't think it's premature given this optimization technique is
> already being deployed elsewhere, see:
>
> https://lwn.net/Articles/774347/

The technique is fine, but that doesn't mean it should be applied
everywhere.  Is *this* code path really going to benefit from the
optimization?

-Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-18 16:05         ` Dan Williams
  2019-04-18 16:10           ` Jeff Moyer
@ 2019-04-18 16:18           ` Christoph Hellwig
  2019-04-18 18:14             ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-04-18 16:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Jan Kara, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Christoph Hellwig, Len Brown,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, lcapitulino,
	Kevin Wolf, Igor Mammedov, Nitesh Narayan Lal, Rik van Riel,
	Stefan Hajnoczi, Andrea Arcangeli, David Hildenbrand, david,
	cohuck, Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> > > I'd either add a comment about avoiding retpoline overhead here or just
> > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > > get confused by the code.
> >
> > Isn't this premature optimization?  I really don't like adding things
> > like this without some numbers to show it's worth it.
> 
> I don't think it's premature given this optimization technique is
> already being deployed elsewhere, see:
> 
> https://lwn.net/Articles/774347/

For one this one was backed by numbers, and second after feedback
from Linux we switched to the NULL pointer check instead.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-18 16:18           ` Christoph Hellwig
@ 2019-04-18 18:14             ` Dan Williams
  2019-04-22 15:51               ` Jeff Moyer
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2019-04-18 18:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Jan Kara, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Len Brown, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, lcapitulino, Kevin Wolf, Igor Mammedov,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> > > > I'd either add a comment about avoiding retpoline overhead here or just
> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > > > get confused by the code.
> > >
> > > Isn't this premature optimization?  I really don't like adding things
> > > like this without some numbers to show it's worth it.
> >
> > I don't think it's premature given this optimization technique is
> > already being deployed elsewhere, see:
> >
> > https://lwn.net/Articles/774347/
>
> For one this one was backed by numbers, and second after feedback
> from Linux we switched to the NULL pointer check instead.

Ok I should have noticed the switch to NULL pointer check. However,
the question still stands do we want everyone to run numbers to
justify this optimization, or make it a new common kernel coding
practice to do:

    if (!object->op)
        generic_op(object);
    else
        object->op(object);

...in hot paths? I agree with not doing premature optimization in
principle, but this hack is minimally intrusive from a readability
perspective similar to likely()/unlikely() usage which also don't come
with numbers on a per patch basis.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-18 18:14             ` Dan Williams
@ 2019-04-22 15:51               ` Jeff Moyer
  2019-04-22 19:44                 ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2019-04-22 15:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jan Kara, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Len Brown, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, lcapitulino, Kevin Wolf, Igor Mammedov,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

Dan Williams <dan.j.williams@intel.com> writes:

> On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> > > > I'd either add a comment about avoiding retpoline overhead here or just
>> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> > > > get confused by the code.
>> > >
>> > > Isn't this premature optimization?  I really don't like adding things
>> > > like this without some numbers to show it's worth it.
>> >
>> > I don't think it's premature given this optimization technique is
>> > already being deployed elsewhere, see:
>> >
>> > https://lwn.net/Articles/774347/
>>
>> For one this one was backed by numbers, and second after feedback
>> from Linux we switched to the NULL pointer check instead.
>
> Ok I should have noticed the switch to NULL pointer check. However,
> the question still stands do we want everyone to run numbers to
> justify this optimization, or make it a new common kernel coding
> practice to do:
>
>     if (!object->op)
>         generic_op(object);
>     else
>         object->op(object);
>
> ...in hot paths?

I don't think nvdimm_flush is a hot path.  Numbers of some
representative workload would prove one of us right.

-Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-22 15:51               ` Jeff Moyer
@ 2019-04-22 19:44                 ` Dan Williams
  2019-04-22 21:03                   ` Jeff Moyer
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2019-04-22 19:44 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jan Kara, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Len Brown, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, lcapitulino, Kevin Wolf, Igor Mammedov,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> >> > > > I'd either add a comment about avoiding retpoline overhead here or just
> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> >> > > > get confused by the code.
> >> > >
> >> > > Isn't this premature optimization?  I really don't like adding things
> >> > > like this without some numbers to show it's worth it.
> >> >
> >> > I don't think it's premature given this optimization technique is
> >> > already being deployed elsewhere, see:
> >> >
> >> > https://lwn.net/Articles/774347/
> >>
> >> For one this one was backed by numbers, and second after feedback
> >> from Linux we switched to the NULL pointer check instead.
> >
> > Ok I should have noticed the switch to NULL pointer check. However,
> > the question still stands do we want everyone to run numbers to
> > justify this optimization, or make it a new common kernel coding
> > practice to do:
> >
> >     if (!object->op)
> >         generic_op(object);
> >     else
> >         object->op(object);
> >
> > ...in hot paths?
>
> I don't think nvdimm_flush is a hot path.  Numbers of some
> representative workload would prove one of us right.

I'd rather say that the if "if (!op) do_generic()" pattern is more
readable in the general case, saves grepping for who set the op in the
common case. The fact that it has the potential to be faster is gravy
at that point.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-22 19:44                 ` Dan Williams
@ 2019-04-22 21:03                   ` Jeff Moyer
  2019-04-23  4:07                     ` Pankaj Gupta
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Moyer @ 2019-04-22 21:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jan Kara, Pankaj Gupta, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Len Brown, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, lcapitulino, Kevin Wolf, Igor Mammedov,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia

Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <hch@infradead.org> wrote:
>> >>
>> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> >> > > > I'd either add a comment about avoiding retpoline overhead here or just
>> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> >> > > > get confused by the code.
>> >> > >
>> >> > > Isn't this premature optimization?  I really don't like adding things
>> >> > > like this without some numbers to show it's worth it.
>> >> >
>> >> > I don't think it's premature given this optimization technique is
>> >> > already being deployed elsewhere, see:
>> >> >
>> >> > https://lwn.net/Articles/774347/
>> >>
>> >> For one this one was backed by numbers, and second after feedback
>> >> from Linux we switched to the NULL pointer check instead.
>> >
>> > Ok I should have noticed the switch to NULL pointer check. However,
>> > the question still stands do we want everyone to run numbers to
>> > justify this optimization, or make it a new common kernel coding
>> > practice to do:
>> >
>> >     if (!object->op)
>> >         generic_op(object);
>> >     else
>> >         object->op(object);
>> >
>> > ...in hot paths?
>>
>> I don't think nvdimm_flush is a hot path.  Numbers of some
>> representative workload would prove one of us right.
>
> I'd rather say that the if "if (!op) do_generic()" pattern is more
> readable in the general case, saves grepping for who set the op in the
> common case. The fact that it has the potential to be faster is gravy
> at that point.

If the primary motivation is performance, then I'd expect performance
numbers to back it up.  If that isn't the primary motivation, then
choose whichever way you feel is appropriate.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support
  2019-04-22 21:03                   ` Jeff Moyer
@ 2019-04-23  4:07                     ` Pankaj Gupta
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Gupta @ 2019-04-23  4:07 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Christoph Hellwig, Jan Kara, linux-nvdimm,
	Linux Kernel Mailing List, virtualization, KVM list,
	linux-fsdevel, Linux ACPI, Qemu Developers, linux-ext4,
	linux-xfs, Ross Zwisler, Vishal L Verma, Dave Jiang,
	Michael S. Tsirkin, Jason Wang, Matthew Wilcox,
	Rafael J. Wysocki, Len Brown, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, lcapitulino, Kevin Wolf, Igor Mammedov,
	Nitesh Narayan Lal, Rik van Riel, Stefan Hajnoczi,
	Andrea Arcangeli, David Hildenbrand, david, cohuck,
	Xiao Guangrong, Paolo Bonzini, kilobyte, yuval shaia


> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <hch@infradead.org>
> >> > wrote:
> >> >>
> >> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> >> >> > > > I'd either add a comment about avoiding retpoline overhead here
> >> >> > > > or just
> >> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that
> >> >> > > > people don't
> >> >> > > > get confused by the code.
> >> >> > >
> >> >> > > Isn't this premature optimization?  I really don't like adding
> >> >> > > things
> >> >> > > like this without some numbers to show it's worth it.
> >> >> >
> >> >> > I don't think it's premature given this optimization technique is
> >> >> > already being deployed elsewhere, see:
> >> >> >
> >> >> > https://lwn.net/Articles/774347/
> >> >>
> >> >> For one this one was backed by numbers, and second after feedback
> >> >> from Linux we switched to the NULL pointer check instead.
> >> >
> >> > Ok I should have noticed the switch to NULL pointer check. However,
> >> > the question still stands do we want everyone to run numbers to
> >> > justify this optimization, or make it a new common kernel coding
> >> > practice to do:
> >> >
> >> >     if (!object->op)
> >> >         generic_op(object);
> >> >     else
> >> >         object->op(object);
> >> >
> >> > ...in hot paths?
> >>
> >> I don't think nvdimm_flush is a hot path.  Numbers of some
> >> representative workload would prove one of us right.
> >
> > I'd rather say that the if "if (!op) do_generic()" pattern is more
> > readable in the general case, saves grepping for who set the op in the
> > common case. The fact that it has the potential to be faster is gravy
> > at that point.
> 
> If the primary motivation is performance, then I'd expect performance
> numbers to back it up.  If that isn't the primary motivation, then
> choose whichever way you feel is appropriate.

Agree. This change enhances the code readability. Will add this change in
v6 with other changes.

Thank you! 

Pankaj

> 
> Cheers,
> Jeff
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-04-23  4:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10  4:08 [PATCH v5 0/6] virtio pmem driver Pankaj Gupta
2019-04-10  4:08 ` [PATCH v5 1/6] libnvdimm: nd_region flush callback support Pankaj Gupta
2019-04-11 14:51   ` Dan Williams
2019-04-11 15:57     ` [Qemu-devel] " Pankaj Gupta
2019-04-11 16:09       ` Dan Williams
2019-04-11 16:23         ` Pankaj Gupta
2019-04-12  8:32     ` Jan Kara
2019-04-12 13:12       ` Jeff Moyer
2019-04-18  6:27         ` [Qemu-devel] " Pankaj Gupta
2019-04-18 16:05         ` Dan Williams
2019-04-18 16:10           ` Jeff Moyer
2019-04-18 16:18           ` Christoph Hellwig
2019-04-18 18:14             ` Dan Williams
2019-04-22 15:51               ` Jeff Moyer
2019-04-22 19:44                 ` Dan Williams
2019-04-22 21:03                   ` Jeff Moyer
2019-04-23  4:07                     ` Pankaj Gupta
2019-04-10  4:08 ` [PATCH v5 2/5] virtio-pmem: Add virtio pmem driver Pankaj Gupta
2019-04-10 12:24   ` Cornelia Huck
2019-04-10 14:38     ` Yuval Shaia
2019-04-10 15:44       ` Pankaj Gupta
2019-04-10 15:38     ` Pankaj Gupta
2019-04-10 13:12   ` Michael S. Tsirkin
2019-04-10 14:03     ` [Qemu-devel] " Pankaj Gupta
2019-04-10 14:31       ` Cornelia Huck
2019-04-10 16:46         ` Michael S. Tsirkin
2019-04-10 16:52           ` Cornelia Huck
2019-04-10 14:41   ` Yuval Shaia
2019-04-10  4:08 ` [PATCH v5 3/6] libnvdimm: add dax_dev sync flag Pankaj Gupta
2019-04-10  8:28   ` Jan Kara
2019-04-10  8:38     ` Pankaj Gupta
2019-04-11 14:56   ` Dan Williams
2019-04-11 15:39     ` Pankaj Gupta
2019-04-10  4:08 ` [PATCH v5 4/6] dax: check synchronous mapping is supported Pankaj Gupta
2019-04-10  8:25   ` Jan Kara
2019-04-10  8:31     ` Pankaj Gupta
2019-04-10  4:08 ` [PATCH v5 5/6] ext4: disable map_sync for async flush Pankaj Gupta
2019-04-10  4:08 ` [PATCH v5 6/6] xfs: " Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).