linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support
@ 2019-08-21 17:57 Vivek Goyal
  2019-08-21 17:57 ` [PATCH 01/19] dax: remove block device dependencies Vivek Goyal
                   ` (18 more replies)
  0 siblings, 19 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

Hi,

This patch series enables DAX support for virtio-fs filesystem. Patches
are based on 5.3-rc5 kernel and need first patch series posted for
virtio-fs support with subject "virtio-fs: shared file system for virtual
machines".

https://www.redhat.com/archives/virtio-fs/2019-August/msg00281.html

Enabling DAX seems to improve performance for most of the operations
in general a great deal. I have reported performance numbers in first patch
series so I am not repeating these here.

Any comments or feedback is welcome.

Thanks
Vivek

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (4):
  dax: remove block device dependencies
  fuse, dax: add fuse_conn->dax_dev field
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse, dax: add DAX mmap support

Vivek Goyal (12):
  dax: Pass dax_dev to dax_writeback_mapping_range()
  fuse: Keep a list of free dax memory ranges
  fuse: implement FUSE_INIT map_alignment field
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse: Define dax address space operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse: Maintain a list of busy elements
  dax: Create a range version of dax_layout_busy_page()
  fuse: Add logic to free up a memory range
  fuse: Release file in process context
  fuse: Take inode lock for dax inode truncation

 drivers/dax/super.c                |    3 +-
 drivers/virtio/virtio_mmio.c       |   32 +
 drivers/virtio/virtio_pci_modern.c |  108 +++
 fs/dax.c                           |   89 +-
 fs/ext2/inode.c                    |    2 +-
 fs/ext4/inode.c                    |    2 +-
 fs/fuse/cuse.c                     |    3 +-
 fs/fuse/dir.c                      |    2 +
 fs/fuse/file.c                     | 1206 +++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h                   |   99 ++-
 fs/fuse/inode.c                    |  138 +++-
 fs/fuse/virtio_fs.c                |  134 +++-
 fs/xfs/xfs_aops.c                  |    2 +-
 include/linux/dax.h                |   12 +-
 include/linux/virtio_config.h      |   17 +
 include/uapi/linux/fuse.h          |   47 +-
 include/uapi/linux/virtio_fs.h     |    3 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h    |   11 +-
 19 files changed, 1868 insertions(+), 53 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 01/19] dax: remove block device dependencies
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-26 11:51   ` Christoph Hellwig
  2019-08-21 17:57 ` [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Dan Williams

From: Stefan Hajnoczi <stefanha@redhat.com>

Although struct dax_device itself is not tied to a block device, some
DAX code assumes there is a block device.  Make block devices optional
by allowing bdev to be NULL in commonly used DAX APIs.

When there is no block device:
 * Skip the partition offset calculation in bdev_dax_pgoff()
 * Skip the blkdev_issue_zeroout() optimization

Note that more block device assumptions remain but I haven't reach those
code paths yet.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 drivers/dax/super.c | 3 ++-
 fs/dax.c            | 7 ++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 26a654dbc69a..3cbc97f3e653 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
-	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+	phys_addr_t phys_off = (start_sect + sector) * 512;
 
 	if (pgoff)
 		*pgoff = PHYS_PFN(phys_off);
diff --git a/fs/dax.c b/fs/dax.c
index 6bf81f931de3..a11147bbaf9e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1046,7 +1046,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 static bool dax_range_is_aligned(struct block_device *bdev,
 				 unsigned int offset, unsigned int length)
 {
-	unsigned short sector_size = bdev_logical_block_size(bdev);
+	unsigned short sector_size;
+
+	if (!bdev)
+		return false;
+
+	sector_size = bdev_logical_block_size(bdev);
 
 	if (!IS_ALIGNED(offset, sector_size))
 		return false;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
  2019-08-21 17:57 ` [PATCH 01/19] dax: remove block device dependencies Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-26 11:53   ` Christoph Hellwig
  2019-08-21 17:57 ` [PATCH 03/19] virtio: Add get_shm_region method Vivek Goyal
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Dan Williams

Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
is searched from that bdev name.

virtio-fs does not have a bdev. So pass in dax_dev also to
dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
used otherwise dax_dev is searched using bdev.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            | 16 ++++++++++------
 fs/ext2/inode.c     |  2 +-
 fs/ext4/inode.c     |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/dax.h |  6 ++++--
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a11147bbaf9e..60620a37030c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -936,12 +936,12 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc)
 {
 	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
 	struct inode *inode = mapping->host;
 	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-	struct dax_device *dax_dev;
 	void *entry;
 	int ret = 0;
 	unsigned int scanned = 0;
@@ -952,9 +952,12 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
 		return 0;
 
-	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-	if (!dax_dev)
-		return -EIO;
+	if (bdev) {
+		WARN_ON(dax_dev);
+		dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+		if (!dax_dev)
+			return -EIO;
+	}
 
 	trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
@@ -976,7 +979,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
-	put_dax(dax_dev);
+	if (bdev)
+		put_dax(dax_dev);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7004ce581a32..4e3870c4e255 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -958,7 +958,7 @@ static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	return dax_writeback_mapping_range(mapping,
-			mapping->host->i_sb->s_bdev, wbc);
+			mapping->host->i_sb->s_bdev, NULL, wbc);
 }
 
 const struct address_space_operations ext2_aops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..75b85c56c732 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, NULL, wbc);
 	trace_ext4_writepages_result(inode, wbc, ret,
 				     nr_to_write - wbc->nr_to_write);
 	percpu_up_read(&sbi->s_journal_flag_rwsem);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f16d5f196c6b..71a7007509c4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1120,7 +1120,7 @@ xfs_dax_writepages(
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return dax_writeback_mapping_range(mapping,
-			xfs_find_bdev_for_inode(mapping->host), wbc);
+			xfs_find_bdev_for_inode(mapping->host), NULL, wbc);
 }
 
 STATIC int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9bd8528bd305..e7f40108f2c9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,7 +141,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
@@ -180,7 +181,8 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct block_device *bdev, struct dax_device *dax_dev,
+		struct writeback_control *wbc)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 03/19] virtio: Add get_shm_region method
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
  2019-08-21 17:57 ` [PATCH 01/19] dax: remove block device dependencies Vivek Goyal
  2019-08-21 17:57 ` [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 04/19] virtio: Implement get_shm_region for PCI transport Vivek Goyal
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 include/linux/virtio_config.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+       u64 addr;
+       u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
  *      the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
 			       const struct cpumask *cpu_mask);
 	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
 			int index);
+	bool (*get_shm_region)(struct virtio_device *vdev,
+			       struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
 	return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+                         struct virtio_shm_region *region, u8 id)
+{
+	if (!vdev->config->get_shm_region)
+		return false;
+	return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
 	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (2 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 03/19] virtio: Add get_shm_region method Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-26  1:43   ` [Virtio-fs] " piaojun
  2019-08-27  8:34   ` Cornelia Huck
  2019-08-21 17:57 ` [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf,
	kvm, kbuild test robot

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Cc: kvm@vger.kernel.org
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
---
 drivers/virtio/virtio_pci_modern.c | 108 +++++++++++++++++++++++++++++
 include/uapi/linux/virtio_pci.h    |  11 ++-
 2 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 7abcc50838b8..1cdedd93f42a 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -443,6 +443,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
 	vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+                                   u8 required_id,
+                                   u8 *bar, u64 *offset, u64 *len)
+{
+	int pos;
+
+        for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+             pos > 0;
+             pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+		u8 type, cap_len, id;
+                u32 tmp32;
+                u64 res_offset, res_length;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         cfg_type),
+                                     &type);
+                if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+                        continue;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         cap_len),
+                                     &cap_len);
+		if (cap_len != sizeof(struct virtio_pci_cap64)) {
+		        printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
+                               __func__, pos, cap_len);
+                        continue;
+                }
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         id),
+                                     &id);
+                if (id != required_id)
+                        continue;
+
+                /* Type, and ID match, looks good */
+                pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         bar),
+                                     bar);
+
+                /* Read the lower 32bit of length and offset */
+                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
+                                      &tmp32);
+                res_offset = tmp32;
+                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
+                                      &tmp32);
+                res_length = tmp32;
+
+                /* and now the top half */
+                pci_read_config_dword(dev,
+                                      pos + offsetof(struct virtio_pci_cap64,
+                                                     offset_hi),
+                                      &tmp32);
+                res_offset |= ((u64)tmp32) << 32;
+                pci_read_config_dword(dev,
+                                      pos + offsetof(struct virtio_pci_cap64,
+                                                     length_hi),
+                                      &tmp32);
+                res_length |= ((u64)tmp32) << 32;
+
+                *offset = res_offset;
+                *len = res_length;
+
+                return pos;
+        }
+        return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct pci_dev *pci_dev = vp_dev->pci_dev;
+	u8 bar;
+	u64 offset, len;
+	phys_addr_t phys_addr;
+	size_t bar_len;
+	char *bar_name;
+	int ret;
+
+	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
+		return false;
+	}
+
+	ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
+	if (ret < 0) {
+		dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
+			__func__);
+		return false;
+	}
+
+	phys_addr = pci_resource_start(pci_dev, bar);
+	bar_len = pci_resource_len(pci_dev, bar);
+
+        if (offset + len > bar_len) {
+                dev_err(&pci_dev->dev,
+                        "%s: bar shorter than cap offset+len\n",
+                        __func__);
+                return false;
+        }
+
+	region->len = len;
+	region->addr = (u64) phys_addr + offset;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.get		= NULL,
 	.set		= NULL,
@@ -457,6 +563,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -473,6 +580,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 /**
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..fe9f43680a1d 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
 #define VIRTIO_PCI_CAP_DEVICE_CFG	4
 /* PCI configuration access */
 #define VIRTIO_PCI_CAP_PCI_CFG		5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
 
 /* This is the PCI capability header: */
 struct virtio_pci_cap {
@@ -121,11 +123,18 @@ struct virtio_pci_cap {
 	__u8 cap_len;		/* Generic PCI field: capability length */
 	__u8 cfg_type;		/* Identifies the structure. */
 	__u8 bar;		/* Where to find it. */
-	__u8 padding[3];	/* Pad to full dword. */
+	__u8 id;		/* Multiple capabilities of the same type */
+	__u8 padding[2];	/* Pad to full dword. */
 	__le32 offset;		/* Offset within bar. */
 	__le32 length;		/* Length of the structure, in bytes. */
 };
 
+struct virtio_pci_cap64 {
+       struct virtio_pci_cap cap;
+       __le32 offset_hi;             /* Most sig 32 bits of offset */
+       __le32 length_hi;             /* Most sig 32 bits of length */
+};
+
 struct virtio_pci_notify_cap {
 	struct virtio_pci_cap cap;
 	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (3 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 04/19] virtio: Implement get_shm_region for PCI transport Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-27  8:39   ` Cornelia Huck
  2019-08-21 17:57 ` [PATCH 06/19] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf, kvm

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Cc: kvm@vger.kernel.org
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
 drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_mmio.h | 11 +++++++++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index e09edb5c5e06..5c07985c8cb8 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -500,6 +500,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+	u64 len, addr;
+
+	/* Select the region we're interested in */
+	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+	/* Read the region size */
+	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+	region->len = len;
+
+	/* Check if region length is -1. If that's the case, the shared memory
+	 * region does not exist and there is no need to proceed further.
+	 */
+	if (len == ~(u64)0) {
+		return false;
+	}
+
+	/* Read the region base address */
+	addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+	addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+	region->addr = addr;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -512,6 +543,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW	0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH	0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL             0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW         0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH        0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW        0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH       0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION	0x0fc
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 06/19] fuse, dax: add fuse_conn->dax_dev field
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (4 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 07/19] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

From: Stefan Hajnoczi <stefanha@redhat.com>

A struct dax_device instance is a prerequisite for the DAX filesystem
APIs.  Let virtio_fs associate a dax_device with a fuse_conn.  Classic
FUSE and CUSE set the pointer to NULL, disabling DAX.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/cuse.c      | 3 ++-
 fs/fuse/fuse_i.h    | 9 ++++++++-
 fs/fuse/inode.c     | 9 ++++++---
 fs/fuse/virtio_fs.c | 5 +++--
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 04727540bdbb..0f9e3c93b056 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,8 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	 * Limit the cuse channel to requests that can
 	 * be represented in file->f_cred->user_ns.
 	 */
-	fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns, NULL, &fuse_dev_fiq_ops,
+					NULL);
 
 	fud = fuse_dev_alloc_install(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 25a6da6ee8c3..ecd9dbc3312e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -77,6 +77,9 @@ struct fuse_mount_data {
 	unsigned max_read;
 	unsigned blksize;
 
+	/* DAX device, may be NULL */
+	struct dax_device *dax_dev;
+
 	/* fuse input queue operations */
 	const struct fuse_iqueue_ops *fiq_ops;
 
@@ -831,6 +834,9 @@ struct fuse_conn {
 
 	/** List of device instances belonging to this connection */
 	struct list_head devices;
+
+	/** DAX device, non-NULL if DAX is supported */
+	struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1061,7 +1067,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
  * Initialize fuse_conn
  */
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
+			struct dax_device *dax_dev,
+			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 16bcf0f95979..6d9258a4091a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -591,7 +591,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 }
 
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
+			struct dax_device *dax_dev,
+			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	atomic64_set(&fc->attr_version, 1);
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->dax_dev = dax_dev;
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 }
@@ -1132,8 +1134,8 @@ int fuse_fill_super_common(struct super_block *sb,
 		err = -ENOMEM;
 		if (!fc)
 			goto err;
-		fuse_conn_init(fc, sb->s_user_ns, mount_data->fiq_ops,
-			       mount_data->fiq_priv);
+		fuse_conn_init(fc, sb->s_user_ns, mount_data->dax_dev,
+			       mount_data->fiq_ops, mount_data->fiq_priv);
 		fc->release = fuse_free_conn;
 	}
 
@@ -1237,6 +1239,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 		goto err_fput;
 	__set_bit(FR_BACKGROUND, &init_req->flags);
 
+	d.dax_dev = NULL;
 	d.fiq_ops = &fuse_dev_fiq_ops;
 	d.fiq_priv = NULL;
 	d.fudptr = &file->private_data;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ce1de9acde84..706b27e0502a 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -986,8 +986,9 @@ static struct dentry *virtio_fs_mount(struct file_system_type *fs_type,
 	fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL);
 	if (!fc)
 		return ERR_PTR(-ENOMEM);
-	fuse_conn_init(fc, get_user_ns(current_user_ns()), &virtio_fs_fiq_ops,
-		       fs);
+	d.dax_dev = NULL;
+	fuse_conn_init(fc, get_user_ns(current_user_ns()), d.dax_dev,
+		       &virtio_fs_fiq_ops, fs);
 	fc->release = fuse_free_conn;
 
 	s = sget(fs_type, virtio_fs_test_super, virtio_fs_set_super, flags, fc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 07/19] virtio_fs, dax: Set up virtio_fs dax_device
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (5 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 06/19] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 08/19] fuse: Keep a list of free dax memory ranges Vivek Goyal
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf, Liu Bo

From: Stefan Hajnoczi <stefanha@redhat.com>

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/fuse_i.h               |   1 +
 fs/fuse/inode.c                |   8 +++
 fs/fuse/virtio_fs.c            | 119 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_fs.h |   3 +
 4 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ecd9dbc3312e..7b365a29b156 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -72,6 +72,7 @@ struct fuse_mount_data {
 	unsigned group_id_present:1;
 	unsigned default_permissions:1;
 	unsigned allow_other:1;
+	unsigned dax:1;
 	unsigned destroy:1;
 	unsigned no_abort:1;
 	unsigned max_read;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 6d9258a4091a..0f58107a8269 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -436,6 +436,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_DAX,
 	OPT_ERR
 };
 
@@ -448,6 +449,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_DAX,			"dax"},
 	{OPT_ERR,			NULL}
 };
 
@@ -534,6 +536,10 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
 			d->blksize = value;
 			break;
 
+		case OPT_DAX:
+			d->dax = 1;
+			break;
+
 		default:
 			return 0;
 		}
@@ -562,6 +568,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->dax_dev)
+		seq_printf(m, ",dax");
 	return 0;
 }
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 706b27e0502a..32604722a7fb 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
  */
 
 #include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_fs.h>
@@ -40,6 +43,12 @@ struct virtio_fs {
 	struct virtio_fs_vq *vqs;
 	unsigned nvqs;            /* number of virtqueues */
 	unsigned num_queues;      /* number of request queues */
+	struct dax_device *dax_dev;
+
+	/* DAX memory window where file contents are mapped */
+	void *window_kaddr;
+	phys_addr_t window_phys_addr;
+	size_t window_len;
 };
 
 struct virtio_fs_forget {
@@ -433,6 +442,109 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
 	vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+				    long nr_pages, void **kaddr, pfn_t *pfn)
+{
+	struct virtio_fs *fs = dax_get_private(dax_dev);
+	phys_addr_t offset = PFN_PHYS(pgoff);
+	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+	if (kaddr)
+		*kaddr = fs->window_kaddr + offset;
+	if (pfn)
+		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+					PFN_DEV | PFN_MAP);
+	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+	.direct_access = virtio_fs_direct_access,
+	.copy_from_iter = virtio_fs_copy_from_iter,
+	.copy_to_iter = virtio_fs_copy_to_iter,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+	struct virtio_fs *fs = data;
+
+	kill_dax(fs->dax_dev);
+	put_dax(fs->dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+	struct virtio_shm_region cache_reg;
+	struct dev_pagemap *pgmap;
+	bool have_cache;
+
+	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+		return 0;
+
+	/* Get cache region */
+	have_cache = virtio_get_shm_region(vdev,
+					   &cache_reg,
+					   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+	if (!have_cache) {
+		dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
+		return 0;
+	} else {
+		dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
+			   cache_reg.len, cache_reg.addr);
+	}
+
+	pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+	/* Ideally we would directly use the PCI BAR resource but
+	 * devm_memremap_pages() wants its own copy in pgmap.  So
+	 * initialize a struct resource from scratch (only the start
+	 * and end fields will be used).
+	 */
+	pgmap->res = (struct resource){
+		.name = "virtio-fs dax window",
+		.start = (phys_addr_t) cache_reg.addr,
+		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+	};
+
+	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
+	if (IS_ERR(fs->window_kaddr))
+		return PTR_ERR(fs->window_kaddr);
+
+	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+	fs->window_len = (phys_addr_t) cache_reg.len;
+
+	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+		" len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+		cache_reg.len);
+
+	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops, 0);
+	if (!fs->dax_dev)
+		return -ENOMEM;
+
+	return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
+}
+
 static int virtio_fs_probe(struct virtio_device *vdev)
 {
 	struct virtio_fs *fs;
@@ -454,6 +566,10 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 	/* TODO vq affinity */
 	/* TODO populate notifications vq */
 
+	ret = virtio_fs_setup_dax(vdev, fs);
+	if (ret < 0)
+		goto out_vqs;
+
 	/* Bring the device online in case the filesystem is mounted and
 	 * requests need to be sent before we return.
 	 */
@@ -468,7 +584,6 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 out_vqs:
 	vdev->config->reset(vdev);
 	virtio_fs_cleanup_vqs(vdev, fs);
-
 out:
 	vdev->priv = NULL;
 	return ret;
@@ -986,7 +1101,7 @@ static struct dentry *virtio_fs_mount(struct file_system_type *fs_type,
 	fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL);
 	if (!fc)
 		return ERR_PTR(-ENOMEM);
-	d.dax_dev = NULL;
+	d.dax_dev = d.dax ? fs->dax_dev : NULL;
 	fuse_conn_init(fc, get_user_ns(current_user_ns()), d.dax_dev,
 		       &virtio_fs_fiq_ops, fs);
 	fc->release = fuse_free_conn;
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index 48f3590dcfbe..d4bb549568eb 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -38,4 +38,7 @@ struct virtio_fs_config {
 	__u32 num_queues;
 } __attribute__((packed));
 
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_SHMCAP_ID_CACHE 0
+
 #endif /* _UAPI_LINUX_VIRTIO_FS_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 08/19] fuse: Keep a list of free dax memory ranges
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (6 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 07/19] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 09/19] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Peng Tao

Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 fs/fuse/fuse_i.h    | 23 ++++++++++++
 fs/fuse/inode.c     | 86 +++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 111 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7b365a29b156..f1059b51c539 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -50,6 +50,10 @@
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_MEM_RANGE_SZ	(2*1024*1024)
+#define FUSE_DAX_MEM_RANGE_PAGES	(FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -97,6 +101,18 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+	/* Will connect in fc->free_ranges to keep track of free memory */
+	struct list_head list;
+
+	/** Position in DAX window */
+	u64 window_offset;
+
+	/** Length of mapping, in bytes */
+	loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -838,6 +854,13 @@ struct fuse_conn {
 
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
+
+	/*
+	 * DAX Window Free Ranges. TODO: This might not be best place to store
+	 * this free list
+	 */
+	long nr_free_ranges;
+	struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0f58107a8269..0af147c70558 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,8 @@
 #include <linux/exportfs.h>
 #include <linux/posix_acl.h>
 #include <linux/pid_namespace.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -598,6 +600,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+	struct fuse_dax_mapping *range, *temp;
+
+	/* Free All allocated elements */
+	list_for_each_entry_safe(range, temp, mem_list, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+				   struct dax_device *dax_dev)
+{
+	long nr_pages, nr_ranges;
+	void *kaddr;
+	pfn_t pfn;
+	struct fuse_dax_mapping *range;
+	LIST_HEAD(mem_ranges);
+	phys_addr_t phys_addr;
+	int ret = 0, id;
+	size_t dax_size = -1;
+	unsigned long i;
+
+	id = dax_read_lock();
+	nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), &kaddr,
+					&pfn);
+	dax_read_unlock(id);
+	if (nr_pages < 0) {
+		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+		return nr_pages;
+	}
+
+	phys_addr = pfn_t_to_phys(pfn);
+	nr_ranges = nr_pages/FUSE_DAX_MEM_RANGE_PAGES;
+	printk("fuse_dax_mem_range_init(): dax mapped %ld pages. nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+	for (i = 0; i < nr_ranges; i++) {
+		range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+		if (!range) {
+			pr_debug("memory allocation for mem_range failed.\n");
+			ret = -ENOMEM;
+			goto out_err;
+		}
+		/* TODO: This offset only works if virtio-fs driver is not
+		 * having some memory hidden at the beginning. This needs
+		 * better handling
+		 */
+		range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
+		range->length = FUSE_DAX_MEM_RANGE_SZ;
+		list_add_tail(&range->list, &mem_ranges);
+	}
+
+	list_replace_init(&mem_ranges, &fc->free_ranges);
+	fc->nr_free_ranges = nr_ranges;
+	return 0;
+out_err:
+	/* Free All allocated elements */
+	fuse_free_dax_mem_ranges(&mem_ranges);
+	return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+					  struct dax_device *dax_dev)
+{
+	return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 			struct dax_device *dax_dev,
 			const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
@@ -628,6 +700,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->dax_dev = dax_dev;
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
+	INIT_LIST_HEAD(&fc->free_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -636,6 +709,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 	if (refcount_dec_and_test(&fc->count)) {
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
+		if (fc->dax_dev)
+			fuse_free_dax_mem_ranges(&fc->free_ranges);
 		put_pid_ns(fc->pid_ns);
 		put_user_ns(fc->user_ns);
 		fc->release(fc);
@@ -1147,6 +1222,14 @@ int fuse_fill_super_common(struct super_block *sb,
 		fc->release = fuse_free_conn;
 	}
 
+	if (mount_data->dax_dev) {
+		err = fuse_dax_mem_range_init(fc, mount_data->dax_dev);
+		if (err) {
+			pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
+			goto err_free_ranges;
+		}
+	}
+
 	fud = fuse_dev_alloc_install(fc);
 	if (!fud)
 		goto err_put_conn;
@@ -1208,6 +1291,9 @@ int fuse_fill_super_common(struct super_block *sb,
 	dput(root_dentry);
  err_dev_free:
 	fuse_dev_free(fud);
+ err_free_ranges:
+	if (mount_data->dax_dev)
+		fuse_free_dax_mem_ranges(&fc->free_ranges);
  err_put_conn:
 	fuse_conn_put(fc);
 	sb->s_fs_info = NULL;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 32604722a7fb..9198c2b84677 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -453,6 +453,8 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	phys_addr_t offset = PFN_PHYS(pgoff);
 	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
 
+	pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%zu\n", nr_pages, max_nr_pages);
+
 	if (kaddr)
 		*kaddr = fs->window_kaddr + offset;
 	if (pfn)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 09/19] fuse: implement FUSE_INIT map_alignment field
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (7 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 08/19] fuse: Keep a list of free dax memory ranges Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 10/19] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/fuse_i.h          |  5 ++++-
 fs/fuse/inode.c           | 19 +++++++++++++++++--
 include/uapi/linux/fuse.h |  7 ++++++-
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f1059b51c539..b020a4071f80 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -50,7 +50,10 @@
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size.  A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
 #define FUSE_DAX_MEM_RANGE_SZ	(2*1024*1024)
 #define FUSE_DAX_MEM_RANGE_PAGES	(FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0af147c70558..d5d134a01117 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -949,9 +949,10 @@ static void process_init_limits(struct fuse_conn *fc, struct fuse_init_out *arg)
 static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct fuse_init_out *arg = &req->misc.init_out;
+	bool ok = true;
 
 	if (req->out.h.error || arg->major != FUSE_KERNEL_VERSION)
-		fc->conn_error = 1;
+		ok = false;
 	else {
 		unsigned long ra_pages;
 
@@ -1014,6 +1015,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 					min_t(unsigned int, FUSE_MAX_MAX_PAGES,
 					max_t(unsigned int, arg->max_pages, 1));
 			}
+			if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+			    (FUSE_DAX_MEM_RANGE_SZ % arg->map_alignment)) {
+				printk(KERN_ERR "FUSE: map_alignment %u incompatible with dax mem range size %u\n",
+				       arg->map_alignment,
+				       FUSE_DAX_MEM_RANGE_SZ);
+				ok = false;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1027,6 +1035,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 		fc->max_write = max_t(unsigned, 4096, fc->max_write);
 		fc->conn_init = 1;
 	}
+
+	if (!ok) {
+		fc->conn_init = 0;
+		fc->conn_error = 1;
+	}
+
 	fuse_set_initialized(fc);
 	wake_up_all(&fc->blocked_waitq);
 }
@@ -1046,7 +1060,8 @@ void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+		FUSE_MAP_ALIGNMENT;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2971d29a42e4..4461fd640cf2 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -274,6 +274,9 @@ struct fuse_file_lock {
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
  * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains byte alignment for
+ *		       foffset and moffset fields in struct
+ *		       fuse_setupmapping_out and fuse_removemapping_one.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -301,6 +304,7 @@ struct fuse_file_lock {
 #define FUSE_CACHE_SYMLINKS	(1 << 23)
 #define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
 #define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
+#define FUSE_MAP_ALIGNMENT      (1 << 26)
 
 /**
  * CUSE INIT request/reply flags
@@ -653,7 +657,8 @@ struct fuse_init_out {
 	uint32_t	time_gran;
 	uint16_t	max_pages;
 	uint16_t	padding;
-	uint32_t	unused[8];
+	uint32_t	map_alignment;
+	uint32_t	unused[7];
 };
 
 #define CUSE_INIT_INFO_MAX 4096
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 10/19] fuse: Introduce setupmapping/removemapping commands
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (8 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 09/19] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 11/19] fuse, dax: Implement dax read/write operations Vivek Goyal
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Peng Tao

Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 include/uapi/linux/fuse.h | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4461fd640cf2..7c2ad3d418df 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -426,6 +426,8 @@ enum fuse_opcode {
 	FUSE_RENAME2		= 45,
 	FUSE_LSEEK		= 46,
 	FUSE_COPY_FILE_RANGE	= 47,
+	FUSE_SETUPMAPPING       = 48,
+	FUSE_REMOVEMAPPING      = 49,
 
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
@@ -850,4 +852,41 @@ struct fuse_copy_file_range_in {
 	uint64_t	flags;
 };
 
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+	/* An already open handle */
+	uint64_t	fh;
+	/* Offset into the file to start the mapping */
+	uint64_t	foffset;
+	/* Length of mapping required */
+	uint64_t	len;
+	/* Flags, FUSE_SETUPMAPPING_FLAG_* */
+	uint64_t	flags;
+	/* Offset in Memory Window */
+	uint64_t	moffset;
+};
+
+struct fuse_setupmapping_out {
+	/* Offsets into the cache of mappings */
+	uint64_t	coffset[FUSE_SETUPMAPPING_ENTRIES];
+        /* Lengths of each mapping */
+        uint64_t	len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+	/* number of fuse_removemapping_one follows */
+	uint32_t        count;
+};
+
+struct fuse_removemapping_one {
+	/* Offset into the dax window start the unmapping */
+	uint64_t        moffset;
+        /* Length of mapping required */
+        uint64_t	len;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY   \
+		(PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
 #endif /* _LINUX_FUSE_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 11/19] fuse, dax: Implement dax read/write operations
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (9 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 10/19] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 19:49   ` Liu Bo
  2019-08-21 17:57 ` [PATCH 12/19] fuse, dax: add DAX mmap support Vivek Goyal
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Miklos Szeredi,
	Liu Bo, Peng Tao

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 fs/fuse/file.c            | 603 +++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          |  23 ++
 fs/fuse/inode.c           |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 627 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c45ffe6f1ecb..f323b7b04414 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,12 @@
 #include <linux/swap.h>
 #include <linux/falloc.h>
 #include <linux/uio.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
+
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
+                     START, LAST, static inline, fuse_dax_interval_tree);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -171,6 +177,248 @@ static void fuse_link_write_file(struct file *file)
 	spin_unlock(&fi->lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+	struct fuse_dax_mapping *dmap = NULL;
+
+	spin_lock(&fc->lock);
+
+	/* TODO: Add logic to try to free up memory if wait is allowed */
+	if (fc->nr_free_ranges <= 0) {
+		spin_unlock(&fc->lock);
+		return NULL;
+	}
+
+	WARN_ON(list_empty(&fc->free_ranges));
+
+	/* Take a free range */
+	dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+					list);
+	list_del_init(&dmap->list);
+	fc->nr_free_ranges--;
+	spin_unlock(&fc->lock);
+	return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	list_add_tail(&dmap->list, &fc->free_ranges);
+	fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	/* Return fuse_dax_mapping to free list */
+	spin_lock(&fc->lock);
+	__dmap_add_to_free_pool(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
+				  struct fuse_dax_mapping *dmap, bool writable,
+				  bool upgrade)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_setupmapping_in inarg;
+	FUSE_ARGS(args);
+	ssize_t err;
+
+	WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+	WARN_ON(fc->nr_free_ranges < 0);
+
+	/* Ask fuse daemon to setup mapping */
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.foffset = offset;
+	inarg.fh = -1;
+	inarg.moffset = dmap->window_offset;
+	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+	inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+	if (writable)
+		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+	args.in.h.opcode = FUSE_SETUPMAPPING;
+	args.in.h.nodeid = fi->nodeid;
+	args.in.numargs = 1;
+	args.in.args[0].size = sizeof(inarg);
+	args.in.args[0].value = &inarg;
+	err = fuse_simple_request(fc, &args);
+	if (err < 0) {
+		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+				 __func__, dmap->window_offset, err);
+		return err;
+	}
+
+	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
+		 " err=%zd\n", offset, writable, err);
+
+	dmap->writable = writable;
+	if (!upgrade) {
+		/* TODO: What locking is required here. For now,
+		 * using fc->lock
+		 */
+		dmap->start = offset;
+		dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+		/* Protected by fi->i_dmap_sem */
+		fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
+		fi->nr_dmaps++;
+	}
+	return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+			struct fuse_removemapping_in *inargp,
+			struct fuse_removemapping_one *remove_one)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	FUSE_ARGS(args);
+
+	args.in.h.opcode = FUSE_REMOVEMAPPING;
+	args.in.h.nodeid = fi->nodeid;
+	args.in.numargs = 2;
+	args.in.args[0].size = sizeof(*inargp);
+	args.in.args[0].value = inargp;
+	args.in.args[1].size = inargp->count * sizeof(*remove_one);
+	args.in.args[1].value = remove_one;
+	return fuse_simple_request(fc, &args);
+}
+
+static int dmap_removemapping_list(struct inode *inode, unsigned num,
+				   struct list_head *to_remove)
+{
+	struct fuse_removemapping_one *remove_one, *ptr;
+	struct fuse_removemapping_in inarg;
+	struct fuse_dax_mapping *dmap;
+	int ret, i = 0, nr_alloc;
+
+	nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
+	remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
+	if (!remove_one)
+		return -ENOMEM;
+
+	ptr = remove_one;
+	list_for_each_entry(dmap, to_remove, list) {
+		ptr->moffset = dmap->window_offset;
+		ptr->len = dmap->length;
+		ptr++;
+		i++;
+		num--;
+		if (i >= nr_alloc || num == 0) {
+			memset(&inarg, 0, sizeof(inarg));
+			inarg.count = i;
+			ret = fuse_send_removemapping(inode, &inarg,
+						      remove_one);
+			if (ret)
+				goto out;
+			ptr = remove_one;
+			i = 0;
+		}
+	}
+out:
+	kfree(remove_one);
+	return ret;
+}
+
+/*
+ * Cleanup dmap entry and add back to free list. This should be called with
+ * fc->lock held.
+ */
+static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
+					    struct fuse_dax_mapping *dmap)
+{
+	pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
+		 "window_offset=0x%llx length=0x%llx\n", dmap->start,
+		 dmap->end, dmap->window_offset, dmap->length);
+	dmap->start = dmap->end = 0;
+	__dmap_add_to_free_pool(fc, dmap);
+}
+
+/*
+ * Free inode dmap entries whose range falls entirely inside [start, end].
+ * Does not take any locks. Caller must take care of any lock requirements.
+ * Lock ordering follows fuse_dax_free_one_mapping().
+ * inode->i_rwsem, fuse_inode->i_mmap_sem and fuse_inode->i_dmap_sem must be
+ * held exclusively, unless it is called from evict_inode() where no one else
+ * is accessing the inode.
+ */
+static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
+				      loff_t start, loff_t end)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap, *n;
+	int err, num = 0;
+	LIST_HEAD(to_remove);
+
+	pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
+
+	/*
+	 * Interval tree search matches intersecting entries. Adjust the range
+	 * to avoid dropping partial valid entries.
+	 */
+	start = ALIGN(start, FUSE_DAX_MEM_RANGE_SZ);
+	end = ALIGN_DOWN(end, FUSE_DAX_MEM_RANGE_SZ);
+
+	while (1) {
+		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, start,
+							 end);
+		if (!dmap)
+			break;
+		fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+		num++;
+		list_add(&dmap->list, &to_remove);
+	}
+
+	/* Nothing to remove */
+	if (list_empty(&to_remove))
+		return;
+
+	WARN_ON(fi->nr_dmaps < num);
+	fi->nr_dmaps -= num;
+	/*
+	 * During umount/shutdown, fuse connection is dropped first
+	 * and evict_inode() is called later. That means any
+	 * removemapping messages are going to fail. Send messages
+	 * only if connection is up. Otherwise fuse daemon is
+	 * responsible for cleaning up any leftover references and
+	 * mappings.
+	 */
+	if (fc->connected) {
+		err = dmap_removemapping_list(inode, num, &to_remove);
+		if (err) {
+			pr_warn("Failed to removemappings. start=0x%llx"
+				" end=0x%llx\n", start, end);
+		}
+	}
+	spin_lock(&fc->lock);
+	list_for_each_entry_safe(dmap, n, &to_remove, list) {
+		list_del_init(&dmap->list);
+		dmap_reinit_add_to_free_pool(fc, dmap);
+	}
+	spin_unlock(&fc->lock);
+}
+
+/*
+ * It is called from evict_inode() and by that time inode is going away. So
+ * this function does not take any locks like fi->i_dmap_sem for traversing
+ * that fuse inode interval tree. If that lock is taken then lock validator
+ * complains of deadlock situation w.r.t fs_reclaim lock.
+ */
+void fuse_cleanup_inode_mappings(struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	/*
+	 * fuse_evict_inode() has alredy called truncate_inode_pages_final()
+	 * before we arrive here. So we should not have to worry about
+	 * any pages/exception entries still associated with inode.
+	 */
+	inode_reclaim_dmap_range(fc, inode, 0, -1);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1481,32 +1729,364 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return res;
 }
 
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_read_iter(iocb, to);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_read_iter(iocb, to);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_read_iter(iocb, to);
+
+	return fuse_cache_read_iter(iocb, to);
 }
 
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_write_iter(iocb, from);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_write_iter(iocb, from);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_write_iter(iocb, from);
+
+	return fuse_cache_write_iter(iocb, from);
+}
+
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+	iomap->addr = IOMAP_NULL_ADDR;
+	iomap->length = length;
+	iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+			struct iomap *iomap, struct fuse_dax_mapping *dmap,
+			unsigned flags)
+{
+	loff_t offset, len;
+	loff_t i_size = i_size_read(inode);
+
+	offset = pos - dmap->start;
+	len = min(length, dmap->length - offset);
+
+	/* If length is beyond end of file, truncate further */
+	if (pos + len > i_size)
+		len = i_size - pos;
+
+	if (len > 0) {
+		iomap->addr = dmap->window_offset + offset;
+		iomap->length = len;
+		if (flags & IOMAP_FAULT)
+			iomap->length = ALIGN(len, PAGE_SIZE);
+		iomap->type = IOMAP_MAPPED;
+		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+				" length 0x%llx\n", __func__, iomap->addr,
+				iomap->offset, iomap->length);
+	} else {
+		/* Mapping beyond end of file is hole */
+		fuse_fill_iomap_hole(iomap, length);
+		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+				"length 0x%llx\n", __func__, iomap->addr,
+				iomap->offset, iomap->length);
+	}
+}
+
+static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
+					 loff_t length, unsigned flags,
+					 struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+	int ret;
+	bool writable = flags & IOMAP_WRITE;
+
+	alloc_dmap = alloc_dax_mapping(fc);
+	if (!alloc_dmap)
+		return -EBUSY;
+
+	/*
+	 * Take write lock so that only one caller can try to setup mapping
+	 * and other waits.
+	 */
+	down_write(&fi->i_dmap_sem);
+	/*
+	 * We dropped lock. Check again if somebody else setup
+	 * mapping already.
+	 */
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
+						pos);
+	if (dmap) {
+		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+		dmap_add_to_free_pool(fc, alloc_dmap);
+		up_write(&fi->i_dmap_sem);
+		return 0;
+	}
+
+	/* Setup one mapping */
+	ret = fuse_setup_one_mapping(inode,
+				     ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+				     alloc_dmap, writable, false);
+	if (ret < 0) {
+		printk("fuse_setup_one_mapping() failed. err=%d"
+			" pos=0x%llx, writable=%d\n", ret, pos, writable);
+		dmap_add_to_free_pool(fc, alloc_dmap);
+		up_write(&fi->i_dmap_sem);
+		return ret;
+	}
+	fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+	up_write(&fi->i_dmap_sem);
+	return 0;
+}
+
+static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
+					 loff_t length, unsigned flags,
+					 struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	int ret;
+
+	/*
+	 * Take exclusive lock so that only one caller can try to setup
+	 * mapping and others wait.
+	 */
+	down_write(&fi->i_dmap_sem);
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+	/* We are holding either inode lock or i_mmap_sem, and that should
+	 * ensure that dmap can't reclaimed or truncated and it should still
+	 * be there in tree despite the fact we dropped and re-acquired the
+	 * lock.
+	 */
+	ret = -EIO;
+	if (WARN_ON(!dmap))
+		goto out_err;
+
+	/* Maybe another thread already upgraded mapping while we were not
+	 * holding lock.
+	 */
+	if (dmap->writable)
+		goto out_fill_iomap;
+
+	ret = fuse_setup_one_mapping(inode,
+				     ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+				     dmap, true, true);
+	if (ret < 0) {
+		printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
+		       ret, pos);
+		goto out_err;
+	}
+
+out_fill_iomap:
+	fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+out_err:
+	up_write(&fi->i_dmap_sem);
+	return ret;
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+			    unsigned flags, struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_dax_mapping *dmap;
+	bool writable = flags & IOMAP_WRITE;
+
+	/* We don't support FIEMAP */
+	BUG_ON(flags & IOMAP_REPORT);
+
+	pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
+			pos, length);
+
+	/*
+	 * Writes beyond end of file are not handled using dax path. Instead
+	 * a fuse write message is sent to daemon
+	 */
+	if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
+		return -EIO;
+
+	iomap->offset = pos;
+	iomap->flags = 0;
+	iomap->bdev = NULL;
+	iomap->dax_dev = fc->dax_dev;
+
+	/*
+	 * Both read/write and mmap path can race here. So we need something
+	 * to make sure if we are setting up mapping, then other path waits
+	 *
+	 * For now, use a semaphore for this. It probably needs to be
+	 * optimized later.
+	 */
+	down_read(&fi->i_dmap_sem);
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+	if (dmap) {
+		if (writable && !dmap->writable) {
+			/* Upgrade read-only mapping to read-write. This will
+			 * require exclusive i_dmap_sem lock as we don't want
+			 * two threads to be trying to this simultaneously
+			 * for same dmap. So drop shared lock and acquire
+			 * exclusive lock.
+			 */
+			up_read(&fi->i_dmap_sem);
+			pr_debug("%s: Upgrading mapping at offset 0x%llx"
+				 " length 0x%llx\n", __func__, pos, length);
+			return iomap_begin_upgrade_mapping(inode, pos, length,
+							   flags, iomap);
+		} else {
+			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+			up_read(&fi->i_dmap_sem);
+			return 0;
+		}
+	} else {
+		up_read(&fi->i_dmap_sem);
+		pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+				__func__, pos, length);
+		if (pos >= i_size_read(inode))
+			goto iomap_hole;
+
+		return iomap_begin_setup_new_mapping(inode, pos, length, flags,
+						     iomap);
+	}
+
+	/*
+	 * If read beyond end of file happnes, fs code seems to return
+	 * it as hole
+	 */
+iomap_hole:
+	fuse_fill_iomap_hole(iomap, length);
+	pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+	return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+			  ssize_t written, unsigned flags,
+			  struct iomap *iomap)
+{
+	/* DAX writes beyond end-of-file aren't handled using iomap, so the
+	 * file size is unchanged and there is nothing to do here.
+	 */
+	return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin = fuse_iomap_begin,
+	.iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock_shared(inode);
+	}
+
+	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+	inode_unlock_shared(inode);
+
+	/* TODO file_accessed(iocb->f_filp) */
+
+	return ret;
+}
+
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return (iov_iter_rw(from) == WRITE &&
+		((iocb->ki_pos) >= i_size_read(inode)));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
+	ssize_t ret;
+
+	ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+	if (ret < 0)
+		return ret;
+
+	fuse_invalidate_attr(inode);
+	fuse_write_update_size(inode, iocb->ki_pos);
+	return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret, count;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	ret = file_remove_privs(iocb->ki_filp);
+	if (ret)
+		goto out;
+	/* TODO file_update_time() but we don't want metadata I/O */
+
+	/* Do not use dax for file extending writes as its an mmap and
+	 * trying to write beyong end of existing page will generate
+	 * SIGBUS.
+	 */
+	if (file_extending_write(iocb, from)) {
+		ret = fuse_dax_direct_write(iocb, from);
+		goto out;
+	}
+
+	ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * If part of the write was file extending, fuse dax path will not
+	 * take care of that. Do direct write instead.
+	 */
+	if (iov_iter_count(from) && file_extending_write(iocb, from)) {
+		count = fuse_dax_direct_write(iocb, from);
+		if (count < 0)
+			goto out;
+		ret += count;
+	}
+
+out:
+	inode_unlock(inode);
+
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
 }
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
@@ -2185,6 +2765,11 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
 
 }
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return -EINVAL; /* TODO */
+}
+
 static int convert_fuse_file_lock(struct fuse_conn *fc,
 				  const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
@@ -3266,6 +3851,7 @@ static const struct address_space_operations fuse_file_aops  = {
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
@@ -3275,4 +3861,9 @@ void fuse_init_file_inode(struct inode *inode)
 	fi->writectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	INIT_LIST_HEAD(&fi->writepages);
+	fi->dmap_tree = RB_ROOT_CACHED;
+
+	if (fc->dax_dev) {
+		inode->i_flags |= S_DAX;
+	}
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b020a4071f80..37b31c5435ff 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -104,16 +104,29 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->end)
+
 /** Translation information for file offsets to DAX window offsets */
 struct fuse_dax_mapping {
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
+	/* For interval tree in file/inode */
+	struct rb_node rb;
+	/** Start Position in file */
+	__u64 start;
+	/** End Position in file */
+	__u64 end;
+	__u64 __subtree_last;
 	/** Position in DAX window */
 	u64 window_offset;
 
 	/** Length of mapping, in bytes */
 	loff_t length;
+
+	/* Is this mapping read-only or read-write */
+	bool writable;
 };
 
 /** FUSE inode */
@@ -201,6 +214,15 @@ struct fuse_inode {
 
 	/** Lock to protect write related fields */
 	spinlock_t lock;
+
+	/*
+	 * Semaphore to protect modifications to dmap_tree
+	 */
+	struct rw_semaphore i_dmap_sem;
+
+	/** Sorted rb tree of struct fuse_dax_mapping elements */
+	struct rb_root_cached dmap_tree;
+	unsigned long nr_dmaps;
 };
 
 /** FUSE inode state bits */
@@ -1242,5 +1264,6 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
+void fuse_cleanup_inode_mappings(struct inode *inode);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d5d134a01117..7e0ed5f3f7e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -81,7 +81,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->attr_version = 0;
 	fi->orig_ino = 0;
 	fi->state = 0;
+	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
 	if (!fi->forget) {
@@ -109,6 +111,10 @@ static void fuse_evict_inode(struct inode *inode)
 	clear_inode(inode);
 	if (inode->i_sb->s_flags & SB_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
+		if (IS_DAX(inode)) {
+			fuse_cleanup_inode_mappings(inode);
+			WARN_ON(fi->nr_dmaps);
+		}
 		fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
 		fi->forget = NULL;
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 7c2ad3d418df..ac23f57d8fd6 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -854,6 +854,7 @@ struct fuse_copy_file_range_in {
 
 #define FUSE_SETUPMAPPING_ENTRIES 8
 #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
 struct fuse_setupmapping_in {
 	/* An already open handle */
 	uint64_t	fh;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 12/19] fuse, dax: add DAX mmap support
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (10 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 11/19] fuse, dax: Implement dax read/write operations Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 13/19] fuse: Define dax address space operations Vivek Goyal
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

From: Stefan Hajnoczi <stefanha@redhat.com>

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/file.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f323b7b04414..32870bb862e7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2730,10 +2730,15 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 	.page_mkwrite	= fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct fuse_file *ff = file->private_data;
 
+	/* DAX mmap is superior to direct_io mmap */
+	if (IS_DAX(file_inode(file)))
+		return fuse_dax_mmap(file, vma);
+
 	if (ff->open_flags & FOPEN_DIRECT_IO) {
 		/* Can't provide the coherency needed for MAP_SHARED */
 		if (vma->vm_flags & VM_MAYSHARE)
@@ -2765,9 +2770,65 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
 
 }
 
+static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+			    bool write)
+{
+	vm_fault_t ret;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	pfn_t pfn;
+
+	if (write)
+		sb_start_pagefault(sb);
+
+	/* TODO inode semaphore to protect faults vs truncate */
+
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+	if (write)
+		sb_end_pagefault(sb);
+
+	return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+				vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+			       enum page_entry_size pe_size)
+{
+	return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+	.fault		= fuse_dax_fault,
+	.huge_fault	= fuse_dax_huge_fault,
+	.page_mkwrite	= fuse_dax_page_mkwrite,
+	.pfn_mkwrite	= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return -EINVAL; /* TODO */
+	file_accessed(file);
+	vma->vm_ops = &fuse_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3825,6 +3886,7 @@ static const struct file_operations fuse_file_operations = {
 	.release	= fuse_release,
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
+	.get_unmapped_area = thp_get_unmapped_area,
 	.flock		= fuse_file_flock,
 	.splice_read	= fuse_file_splice_read,
 	.splice_write	= iter_file_splice_write,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 13/19] fuse: Define dax address space operations
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (11 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 12/19] fuse, dax: add DAX mmap support Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 14/19] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 32870bb862e7..b8bcf49f007f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2556,6 +2556,17 @@ static int fuse_writepages_fill(struct page *page,
 	return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	return dax_writeback_mapping_range(mapping,
+		NULL, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
@@ -3910,6 +3921,13 @@ static const struct address_space_operations fuse_file_aops  = {
 	.write_end	= fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+	.writepages	= fuse_dax_writepages,
+	.direct_IO	= noop_direct_IO,
+	.set_page_dirty	= noop_set_page_dirty,
+	.invalidatepage	= noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -3927,5 +3945,6 @@ void fuse_init_file_inode(struct inode *inode)
 
 	if (fc->dax_dev) {
 		inode->i_flags |= S_DAX;
+		inode->i_data.a_ops = &fuse_dax_file_aops;
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 14/19] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (12 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 13/19] fuse: Define dax address space operations Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 15/19] fuse: Maintain a list of busy elements Vivek Goyal
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.

We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.

Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/dir.c    |  2 ++
 fs/fuse/file.c   | 17 +++++++++++++----
 fs/fuse/fuse_i.h |  7 +++++++
 fs/fuse/inode.c  |  1 +
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index fd8636e67ae9..84c0b638affb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1559,8 +1559,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 	 */
 	if ((is_truncate || !is_wb) &&
 	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache(inode, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
+		up_write(&fi->i_mmap_sem);
 	}
 
 	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b8bcf49f007f..7b70b5ea7f94 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2792,13 +2792,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	if (write)
 		sb_start_pagefault(sb);
 
-	/* TODO inode semaphore to protect faults vs truncate */
-
+	/*
+	 * We need to serialize against not only truncate but also against
+	 * fuse dax memory range reclaim. While a range is being reclaimed,
+	 * we do not want any read/write/mmap to make progress and try
+	 * to populate page cache or access memory we are trying to free.
+	 */
+	down_read(&get_fuse_inode(inode)->i_mmap_sem);
 	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 
+	up_read(&get_fuse_inode(inode)->i_mmap_sem);
+
 	if (write)
 		sb_end_pagefault(sb);
 
@@ -3767,9 +3774,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 			file_update_time(file);
 	}
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache_range(inode, offset, offset + length - 1);
-
+		up_write(&fi->i_mmap_sem);
+	}
 	fuse_invalidate_attr(inode);
 
 out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 37b31c5435ff..125bb7123651 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -220,6 +220,13 @@ struct fuse_inode {
 	 */
 	struct rw_semaphore i_dmap_sem;
 
+	/**
+	 * Can't take inode lock in fault path (leads to circular dependency).
+	 * So take this in fuse dax fault path to make sure truncate and
+	 * punch hole etc. can't make progress in parallel.
+	 */
+	struct rw_semaphore i_mmap_sem;
+
 	/** Sorted rb tree of struct fuse_dax_mapping elements */
 	struct rb_root_cached dmap_tree;
 	unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7e0ed5f3f7e6..52135b4616d2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -83,6 +83,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->state = 0;
 	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_mmap_sem);
 	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 15/19] fuse: Maintain a list of busy elements
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (13 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 14/19] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 16/19] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c   | 22 ++++++++++++++++++++++
 fs/fuse/fuse_i.h |  8 ++++++++
 fs/fuse/inode.c  |  4 ++++
 3 files changed, 34 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7b70b5ea7f94..8c1777fb61f7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -200,6 +200,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+				    struct fuse_dax_mapping *dmap)
+{
+	list_del_init(&dmap->busy_list);
+	WARN_ON(fc->nr_busy_ranges == 0);
+	fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+				  struct fuse_dax_mapping *dmap)
+{
+	spin_lock(&fc->lock);
+	__dmap_remove_busy_list(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
 /* This assumes fc->lock is held */
 static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 				struct fuse_dax_mapping *dmap)
@@ -265,6 +282,10 @@ static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
 		/* Protected by fi->i_dmap_sem */
 		fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
 		fi->nr_dmaps++;
+		spin_lock(&fc->lock);
+		list_add_tail(&dmap->busy_list, &fc->busy_ranges);
+		fc->nr_busy_ranges++;
+		spin_unlock(&fc->lock);
 	}
 	return 0;
 }
@@ -334,6 +355,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
 	pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
 		 "window_offset=0x%llx length=0x%llx\n", dmap->start,
 		 dmap->end, dmap->window_offset, dmap->length);
+	__dmap_remove_busy_list(fc, dmap);
 	dmap->start = dmap->end = 0;
 	__dmap_add_to_free_pool(fc, dmap);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 125bb7123651..070a5c2b6498 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -119,6 +119,10 @@ struct fuse_dax_mapping {
 	/** End Position in file */
 	__u64 end;
 	__u64 __subtree_last;
+
+	/* Will connect in fc->busy_ranges to keep track busy memory */
+	struct list_head busy_list;
+
 	/** Position in DAX window */
 	u64 window_offset;
 
@@ -887,6 +891,10 @@ struct fuse_conn {
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
 
+	/* List of memory ranges which are busy */
+	unsigned long nr_busy_ranges;
+	struct list_head busy_ranges;
+
 	/*
 	 * DAX Window Free Ranges. TODO: This might not be best place to store
 	 * this free list
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 52135b4616d2..b80e76a307f3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -614,6 +614,8 @@ static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
 	/* Free All allocated elements */
 	list_for_each_entry_safe(range, temp, mem_list, list) {
 		list_del(&range->list);
+		if (!list_empty(&range->busy_list))
+			list_del(&range->busy_list);
 		kfree(range);
 	}
 }
@@ -658,6 +660,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		 */
 		range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
 		range->length = FUSE_DAX_MEM_RANGE_SZ;
+		INIT_LIST_HEAD(&range->busy_list);
 		list_add_tail(&range->list, &mem_ranges);
 	}
 
@@ -708,6 +711,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
+	INIT_LIST_HEAD(&fc->busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 16/19] dax: Create a range version of dax_layout_busy_page()
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (14 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 15/19] fuse: Maintain a list of busy elements Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 17/19] fuse: Add logic to free up a memory range Vivek Goyal
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, Dan Williams

While reclaiming a dax range, we do not want to unamap whole file instead
want to make sure pages in a certain range do not have references taken
on them. Hence create a version of the function which allows to pass in
a range.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            | 66 ++++++++++++++++++++++++++++++++-------------
 include/linux/dax.h |  6 +++++
 2 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 60620a37030c..435f5b67e828 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -557,27 +557,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
-/**
- * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
+/*
+ * Partial pages are included. If end is 0, pages in the range from start
+ * to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+					loff_t start, loff_t end)
 {
-	XA_STATE(xas, &mapping->i_pages, 0);
 	void *entry;
 	unsigned int scanned = 0;
 	struct page *page = NULL;
+	pgoff_t start_idx = start >> PAGE_SHIFT;
+	pgoff_t end_idx = end >> PAGE_SHIFT;
+	XA_STATE(xas, &mapping->i_pages, start_idx);
+	loff_t len, lstart = round_down(start, PAGE_SIZE);
 
 	/*
 	 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -588,6 +581,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
 		return NULL;
 
+	/* If end == 0, all pages from start to till end of file */
+	if (!end) {
+		end_idx = ULONG_MAX;
+		len = 0;
+	} else {
+		/* length is being calculated from lstart and not start.
+		 * This is due to behavior of unmap_mapping_range(). If
+		 * start is say 4094 and end is on 4093 then want to
+		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
+		 * will unmap only page at idx 0. If we calculate len
+		 * from the rounded down start, this problem should not
+		 * happen.
+		 */
+		len = end - lstart + 1;
+	}
+
 	/*
 	 * If we race get_user_pages_fast() here either we'll see the
 	 * elevated page count in the iteration and wait, or
@@ -600,10 +609,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	 * guaranteed to either see new references or prevent new
 	 * references from being established.
 	 */
-	unmap_mapping_range(mapping, 0, 0, 0);
+	unmap_mapping_range(mapping, start, len, 0);
 
 	xas_lock_irq(&xas);
-	xas_for_each(&xas, entry, ULONG_MAX) {
+	xas_for_each(&xas, entry, end_idx) {
 		if (WARN_ON_ONCE(!xa_is_value(entry)))
 			continue;
 		if (unlikely(dax_is_locked(entry)))
@@ -624,6 +633,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	xas_unlock_irq(&xas);
 	return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return dax_layout_busy_page_range(mapping, 0, 0);
+}
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e7f40108f2c9..3ef6686c080b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -145,6 +145,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
 #else
@@ -180,6 +181,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 	return NULL;
 }
 
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+	return NULL;
+}
+
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct dax_device *dax_dev,
 		struct writeback_control *wbc)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 17/19] fuse: Add logic to free up a memory range
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (15 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 16/19] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 18/19] fuse: Release file in process context Vivek Goyal
  2019-08-21 17:57 ` [PATCH 19/19] fuse: Take inode lock for dax inode truncation Vivek Goyal
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert, kbuild test robot, Liu Bo

Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

In certain cases (write path), process can steal one of its busy
dax ranges (inline reclaim) if free range is not available.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

	down_write(&fi->i_mmap_sem);
	down_write(&fi->i_dmap_sem);


Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c      | 488 +++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h    |  25 +++
 fs/fuse/inode.c     |   5 +
 fs/fuse/virtio_fs.c |  10 +
 4 files changed, 519 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8c1777fb61f7..2ff7624d58c0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -25,6 +25,8 @@
 INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
                      START, LAST, static inline, fuse_dax_interval_tree);
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+				struct inode *inode);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
 {
@@ -177,6 +179,28 @@ static void fuse_link_write_file(struct file *file)
 	spin_unlock(&fi->lock);
 }
 
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+	unsigned long free_threshold;
+
+	/* If number of free ranges are below threshold, start reclaim */
+	free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+				(unsigned long)1);
+	if (fc->nr_free_ranges < free_threshold) {
+		pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+		queue_delayed_work(system_long_wq, &fc->dax_free_work,
+				   msecs_to_jiffies(delay_ms));
+	}
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+	spin_lock(&fc->lock);
+	__kick_dmap_free_worker(fc, delay_ms);
+	spin_unlock(&fc->lock);
+}
+
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
 	struct fuse_dax_mapping *dmap = NULL;
@@ -186,7 +210,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	/* TODO: Add logic to try to free up memory if wait is allowed */
 	if (fc->nr_free_ranges <= 0) {
 		spin_unlock(&fc->lock);
-		return NULL;
+		goto out_kick;
 	}
 
 	WARN_ON(list_empty(&fc->free_ranges));
@@ -197,6 +221,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	list_del_init(&dmap->list);
 	fc->nr_free_ranges--;
 	spin_unlock(&fc->lock);
+
+out_kick:
+	kick_dmap_free_worker(fc, 0);
 	return dmap;
 }
 
@@ -223,6 +250,8 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 {
 	list_add_tail(&dmap->list, &fc->free_ranges);
 	fc->nr_free_ranges++;
+	/* TODO: Wake up only when needed */
+	wake_up(&fc->dax_range_waitq);
 }
 
 static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -274,9 +303,15 @@ static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
 
 	dmap->writable = writable;
 	if (!upgrade) {
-		/* TODO: What locking is required here. For now,
-		 * using fc->lock
+		/*
+		 * We don't take a refernce on inode. inode is valid right now
+		 * and when inode is going away, cleanup logic should first
+		 * cleanup dmap entries.
+		 *
+		 * TODO: Do we need to ensure that we are holding inode lock
+		 * as well.
 		 */
+		dmap->inode = inode;
 		dmap->start = offset;
 		dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
 		/* Protected by fi->i_dmap_sem */
@@ -356,6 +391,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
 		 "window_offset=0x%llx length=0x%llx\n", dmap->start,
 		 dmap->end, dmap->window_offset, dmap->length);
 	__dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
 	dmap->start = dmap->end = 0;
 	__dmap_add_to_free_pool(fc, dmap);
 }
@@ -424,6 +460,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
 	spin_unlock(&fc->lock);
 }
 
+static int dmap_removemapping_one(struct inode *inode,
+				  struct fuse_dax_mapping *dmap)
+{
+	struct fuse_removemapping_one forget_one;
+	struct fuse_removemapping_in inarg;
+
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.count = 1;
+	memset(&forget_one, 0, sizeof(forget_one));
+	forget_one.moffset = dmap->window_offset;
+	forget_one.len = dmap->length;
+
+	return fuse_send_removemapping(inode, &inarg, &forget_one);
+}
+
 /*
  * It is called from evict_inode() and by that time inode is going away. So
  * this function does not take any locks like fi->i_dmap_sem for traversing
@@ -1816,6 +1867,18 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
 		if (flags & IOMAP_FAULT)
 			iomap->length = ALIGN(len, PAGE_SIZE);
 		iomap->type = IOMAP_MAPPED;
+
+		/*
+		 * increace refcnt so that reclaim code knows this dmap is in
+		 * use. This assumes i_dmap_sem mutex is held either
+		 * shared/exclusive.
+		 */
+		refcount_inc(&dmap->refcnt);
+
+		/* iomap->private should be NULL */
+		WARN_ON_ONCE(iomap->private);
+		iomap->private = dmap;
+
 		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
 				" length 0x%llx\n", __func__, iomap->addr,
 				iomap->offset, iomap->length);
@@ -1838,8 +1901,23 @@ static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
 	int ret;
 	bool writable = flags & IOMAP_WRITE;
 
-	alloc_dmap = alloc_dax_mapping(fc);
-	if (!alloc_dmap)
+	/* Can't do reclaim in fault path yet due to lock ordering.
+	 * Read path takes shared inode lock and that's not sufficient
+	 * for inline range reclaim. Caller needs to drop lock, wait
+	 * and retry.
+	 */
+	if (flags & IOMAP_FAULT || !(flags & IOMAP_WRITE)) {
+		alloc_dmap = alloc_dax_mapping(fc);
+		if (!alloc_dmap)
+			return -ENOSPC;
+	} else {
+		alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+		if (IS_ERR(alloc_dmap))
+			return PTR_ERR(alloc_dmap);
+	}
+
+	/* If we are here, we should have memory allocated */
+	if (WARN_ON(!alloc_dmap))
 		return -EBUSY;
 
 	/*
@@ -1892,14 +1970,25 @@ static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
 	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
 
 	/* We are holding either inode lock or i_mmap_sem, and that should
-	 * ensure that dmap can't reclaimed or truncated and it should still
-	 * be there in tree despite the fact we dropped and re-acquired the
-	 * lock.
+	 * ensure that dmap can't be truncated. We are holding a reference
+	 * on dmap and that should make sure it can't be reclaimed. So dmap
+	 * should still be there in tree despite the fact we dropped and
+	 * re-acquired the i_dmap_sem lock.
 	 */
 	ret = -EIO;
 	if (WARN_ON(!dmap))
 		goto out_err;
 
+	/* We took an extra reference on dmap to make sure its not reclaimd.
+	 * Now we hold i_dmap_sem lock and that reference is not needed
+	 * anymore. Drop it.
+	 */
+	if (refcount_dec_and_test(&dmap->refcnt)) {
+		/* refcount should not hit 0. This object only goes
+		 * away when fuse connection goes away */
+		WARN_ON_ONCE(1);
+	}
+
 	/* Maybe another thread already upgraded mapping while we were not
 	 * holding lock.
 	 */
@@ -1968,7 +2057,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 			 * two threads to be trying to this simultaneously
 			 * for same dmap. So drop shared lock and acquire
 			 * exclusive lock.
+			 *
+			 * Before dropping i_dmap_sem lock, take reference
+			 * on dmap so that its not freed by range reclaim.
 			 */
+			refcount_inc(&dmap->refcnt);
 			up_read(&fi->i_dmap_sem);
 			pr_debug("%s: Upgrading mapping at offset 0x%llx"
 				 " length 0x%llx\n", __func__, pos, length);
@@ -2004,6 +2097,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags,
 			  struct iomap *iomap)
 {
+	struct fuse_dax_mapping *dmap = iomap->private;
+
+	if (dmap) {
+		if (refcount_dec_and_test(&dmap->refcnt)) {
+			/* refcount should not hit 0. This object only goes
+			 * away when fuse connection goes away */
+			WARN_ON_ONCE(1);
+		}
+	}
+
 	/* DAX writes beyond end-of-file aren't handled using iomap, so the
 	 * file size is unchanged and there is nothing to do here.
 	 */
@@ -2018,7 +2121,18 @@ static const struct iomap_ops fuse_iomap_ops = {
 static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 	ssize_t ret;
+	bool retry = false;
+
+retry:
+	if (retry && !(fc->nr_free_ranges > 0)) {
+		ret = -EINTR;
+		if (wait_event_killable_exclusive(fc->dax_range_waitq,
+						  (fc->nr_free_ranges > 0))) {
+			goto out;
+		}
+	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock_shared(inode))
@@ -2030,8 +2144,19 @@ static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
 	inode_unlock_shared(inode);
 
+	/* If a dax range could not be allocated and it can't be reclaimed
+	 * inline, then drop inode lock and retry. Range reclaim logic
+	 * requires exclusive access to inode lock.
+	 *
+	 * TODO: What if -ENOSPC needs to be returned to user space. Fix it.
+	 */
+	if (ret == -ENOSPC) {
+		retry = true;
+		goto retry;
+	}
 	/* TODO file_accessed(iocb->f_filp) */
 
+out:
 	return ret;
 }
 
@@ -2810,10 +2935,21 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	pfn_t pfn;
+	int error = 0;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	bool retry = false;
 
 	if (write)
 		sb_start_pagefault(sb);
 
+retry:
+	if (retry && !(fc->nr_free_ranges > 0)) {
+		ret = -EINTR;
+		if (wait_event_killable_exclusive(fc->dax_range_waitq,
+					(fc->nr_free_ranges > 0)))
+			goto out;
+	}
+
 	/*
 	 * We need to serialize against not only truncate but also against
 	 * fuse dax memory range reclaim. While a range is being reclaimed,
@@ -2821,13 +2957,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	 * to populate page cache or access memory we are trying to free.
 	 */
 	down_read(&get_fuse_inode(inode)->i_mmap_sem);
-	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops);
+	if ((ret & VM_FAULT_ERROR) && error == -ENOSPC) {
+		error = 0;
+		retry = true;
+		up_read(&get_fuse_inode(inode)->i_mmap_sem);
+		goto retry;
+	}
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 
 	up_read(&get_fuse_inode(inode)->i_mmap_sem);
 
+out:
 	if (write)
 		sb_end_pagefault(sb);
 
@@ -3979,3 +4122,330 @@ void fuse_init_file_inode(struct inode *inode)
 		inode->i_data.a_ops = &fuse_dax_file_aops;
 	}
 }
+
+static int dmap_writeback_invalidate(struct inode *inode,
+				     struct fuse_dax_mapping *dmap)
+{
+	int ret;
+
+	ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
+				       dmap->end);
+	if (ret) {
+		printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
+			" end=0x%llx\n", ret, dmap->start, dmap->end);
+		return ret;
+	}
+
+	ret = invalidate_inode_pages2_range(inode->i_mapping,
+					    dmap->start >> PAGE_SHIFT,
+					    dmap->end >> PAGE_SHIFT);
+	if (ret)
+		printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
+
+	return ret;
+}
+
+static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+				   struct fuse_dax_mapping *dmap)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	/*
+	 * igrab() was done to make sure inode won't go under us, and this
+	 * further avoids the race with evict().
+	 */
+	ret = dmap_writeback_invalidate(inode, dmap);
+
+	/* TODO: What to do if above fails? For now,
+	 * leave the range in place.
+	 */
+	if (ret)
+		return ret;
+
+	/* Remove dax mapping from inode interval tree now */
+	fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+	fi->nr_dmaps--;
+
+	ret = dmap_removemapping_one(inode, dmap);
+	if (ret) {
+		pr_warn("Failed to remove mapping. offset=0x%llx len=0x%llx\n",
+			dmap->window_offset, dmap->length);
+	}
+
+	return 0;
+}
+
+static void fuse_wait_dax_page(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+        up_write(&fi->i_mmap_sem);
+        schedule();
+        down_write(&fi->i_mmap_sem);
+}
+
+/* Should be called with fi->i_mmap_sem lock held exclusively */
+static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
+				    loff_t start, loff_t end)
+{
+	struct page *page;
+
+	page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+	if (!page)
+		return 0;
+
+	*retry = true;
+	return ___wait_var_event(&page->_refcount,
+			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+			0, 0, fuse_wait_dax_page(inode));
+}
+
+/* dmap_end == 0 leads to unmapping of whole file */
+static int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
+				  u64 dmap_end)
+{
+	bool	retry;
+	int	ret;
+
+	do {
+		retry = false;
+		ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
+					       dmap_end);
+        } while (ret == 0 && retry);
+
+        return ret;
+}
+
+/* First first mapping in the tree and free it. */
+static struct fuse_dax_mapping *
+inode_reclaim_first_dmap_locked(struct fuse_conn *fc, struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	int ret;
+
+	/* Find fuse dax mapping at file offset inode. */
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
+	if (!dmap)
+		return NULL;
+
+	ret = reclaim_one_dmap_locked(fc, inode, dmap);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/* Clean up dmap. Do not add back to free list */
+	dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
+	dmap->start = dmap->end = 0;
+
+	pr_debug("fuse: reclaimed memory range window_offset=0x%llx,"
+				" length=0x%llx\n", dmap->window_offset,
+				dmap->length);
+	return dmap;
+}
+
+/*
+ * First first mapping in the tree and free it and return it. Do not add
+ * it back to free pool.
+ *
+ * This is called with inode lock held.
+ */
+static struct fuse_dax_mapping *inode_reclaim_first_dmap(struct fuse_conn *fc,
+							 struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	int ret;
+
+	down_write(&fi->i_mmap_sem);
+
+	/* Make sure there are references to inode pages using
+	 * get_user_pages()
+	 *
+	 * TODO: Only check for page range inside dmap (and not whole inode)
+	 */
+	ret = fuse_break_dax_layouts(inode, 0, 0);
+	if (ret) {
+		printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+		       ret);
+		dmap = ERR_PTR(ret);
+		goto out_mmap_sem;
+	}
+	down_write(&fi->i_dmap_sem);
+	dmap = inode_reclaim_first_dmap_locked(fc, inode);
+	up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+	up_write(&fi->i_mmap_sem);
+	return dmap;
+}
+
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+					struct inode *inode)
+{
+	struct fuse_dax_mapping *dmap;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	while(1) {
+		dmap = alloc_dax_mapping(fc);
+		if (dmap)
+			return dmap;
+
+		if (fi->nr_dmaps) {
+			dmap = inode_reclaim_first_dmap(fc, inode);
+			if (dmap)
+				return dmap;
+		}
+		/*
+		 * There are no mappings which can be reclaimed.
+		 * Wait for one.
+		 */
+		if (!(fc->nr_free_ranges > 0)) {
+			if (wait_event_killable_exclusive(fc->dax_range_waitq,
+					(fc->nr_free_ranges > 0)))
+				return ERR_PTR(-EINTR);
+		}
+	}
+}
+
+static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
+					  struct inode *inode, u64 dmap_start)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+
+	/* Find fuse dax mapping at file offset inode. */
+	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+						 dmap_start);
+
+	/* Range already got cleaned up by somebody else */
+	if (!dmap)
+		return 0;
+
+	/* still in use. */
+	if (refcount_read(&dmap->refcnt) > 1)
+		return 0;
+
+	ret = reclaim_one_dmap_locked(fc, inode, dmap);
+	if (ret < 0)
+		return ret;
+
+	/* Cleanup dmap entry and add back to free list */
+	spin_lock(&fc->lock);
+	dmap_reinit_add_to_free_pool(fc, dmap);
+	spin_unlock(&fc->lock);
+	return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
+ *    sure read/write can not reuse a dmap which we might be freeing.
+ */
+static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
+				   u64 dmap_start, u64 dmap_end)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	down_write(&fi->i_mmap_sem);
+	ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+	if (ret) {
+		printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+		       ret);
+		goto out_mmap_sem;
+	}
+
+	down_write(&fi->i_dmap_sem);
+	ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
+	up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+	up_write(&fi->i_mmap_sem);
+	return ret;
+}
+
+static int try_to_free_dmap_chunks(struct fuse_conn *fc,
+				   unsigned long nr_to_free)
+{
+	struct fuse_dax_mapping *dmap, *pos, *temp;
+	int ret, nr_freed = 0;
+	u64 dmap_start = 0, window_offset = 0, dmap_end = 0;
+	struct inode *inode = NULL;
+
+	/* Pick first busy range and free it for now*/
+	while(1) {
+		if (nr_freed >= nr_to_free)
+			break;
+
+		dmap = NULL;
+		spin_lock(&fc->lock);
+
+		if (!fc->nr_busy_ranges) {
+			spin_unlock(&fc->lock);
+			return 0;
+		}
+
+		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+						busy_list) {
+			/* skip this range if it's in use. */
+			if (refcount_read(&pos->refcnt) > 1)
+				continue;
+
+			inode = igrab(pos->inode);
+			/*
+			 * This inode is going away. That will free
+			 * up all the ranges anyway, continue to
+			 * next range.
+			 */
+			if (!inode)
+				continue;
+			/*
+			 * Take this element off list and add it tail. If
+			 * inode lock can't be obtained, this will help with
+			 * selecting new element
+			 */
+			dmap = pos;
+			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
+			dmap_start = dmap->start;
+			dmap_end = dmap->end;
+			window_offset = dmap->window_offset;
+			break;
+		}
+		spin_unlock(&fc->lock);
+		if (!dmap)
+			return 0;
+
+		ret = lookup_and_reclaim_dmap(fc, inode, dmap_start, dmap_end);
+		iput(inode);
+		if (ret) {
+			printk("%s(window_offset=0x%llx) failed. err=%d\n",
+				__func__, window_offset, ret);
+			return ret;
+		}
+		nr_freed++;
+	}
+	return 0;
+}
+
+/* TODO: This probably should go in inode.c */
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+	int ret;
+	struct fuse_conn *fc = container_of(work, struct fuse_conn,
+						dax_free_work.work);
+	pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
+		 " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
+		 fc->nr_busy_ranges);
+
+	ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
+	if (ret) {
+		pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
+			 ret);
+	}
+
+	/* If number of free ranges are still below threhold, requeue */
+	kick_dmap_free_worker(fc, 1);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 070a5c2b6498..5f2f348536aa 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -57,6 +57,16 @@
 #define FUSE_DAX_MEM_RANGE_SZ	(2*1024*1024)
 #define FUSE_DAX_MEM_RANGE_PAGES	(FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
 
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK		(10)
+
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD	(20)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -109,6 +119,9 @@ struct fuse_forget_link {
 
 /** Translation information for file offsets to DAX window offsets */
 struct fuse_dax_mapping {
+	/* Pointer to inode where this memory range is mapped */
+	struct inode *inode;
+
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
@@ -131,6 +144,9 @@ struct fuse_dax_mapping {
 
 	/* Is this mapping read-only or read-write */
 	bool writable;
+
+	/* reference count when the mapping is used by dax iomap. */
+	refcount_t refcnt;
 };
 
 /** FUSE inode */
@@ -895,12 +911,20 @@ struct fuse_conn {
 	unsigned long nr_busy_ranges;
 	struct list_head busy_ranges;
 
+	/* Worker to free up memory ranges */
+	struct delayed_work dax_free_work;
+
+	/* Wait queue for a dax range to become free */
+	wait_queue_head_t dax_range_waitq;
+
 	/*
 	 * DAX Window Free Ranges. TODO: This might not be best place to store
 	 * this free list
 	 */
 	long nr_free_ranges;
 	struct list_head free_ranges;
+
+	unsigned long nr_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1279,6 +1303,7 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
+void fuse_dax_free_mem_worker(struct work_struct *work);
 void fuse_cleanup_inode_mappings(struct inode *inode);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b80e76a307f3..4871933f4557 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -661,11 +661,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
 		range->length = FUSE_DAX_MEM_RANGE_SZ;
 		INIT_LIST_HEAD(&range->busy_list);
+		refcount_set(&range->refcnt, 1);
 		list_add_tail(&range->list, &mem_ranges);
 	}
 
 	list_replace_init(&mem_ranges, &fc->free_ranges);
 	fc->nr_free_ranges = nr_ranges;
+	fc->nr_ranges = nr_ranges;
 	return 0;
 out_err:
 	/* Free All allocated elements */
@@ -692,6 +694,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	atomic_set(&fc->dev_count, 1);
 	init_waitqueue_head(&fc->blocked_waitq);
 	init_waitqueue_head(&fc->reserved_req_waitq);
+	init_waitqueue_head(&fc->dax_range_waitq);
 	fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
@@ -712,6 +715,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
 	INIT_LIST_HEAD(&fc->busy_ranges);
+	INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -720,6 +724,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 	if (refcount_dec_and_test(&fc->count)) {
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
+		flush_delayed_work(&fc->dax_free_work);
 		if (fc->dax_dev)
 			fuse_free_dax_mem_ranges(&fc->free_ranges);
 		put_pid_ns(fc->pid_ns);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9198c2b84677..72b97bcd8e44 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -491,6 +491,15 @@ static void virtio_fs_cleanup_dax(void *data)
 	put_dax(fs->dax_dev);
 }
 
+static void virtio_fs_pagemap_page_free(struct page *page)
+{
+	wake_up_var(&page->_refcount);
+}
+
+static const struct dev_pagemap_ops virtio_fs_pagemap_ops = {
+	.page_free	= virtio_fs_pagemap_page_free,
+};
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 {
 	struct virtio_shm_region cache_reg;
@@ -517,6 +526,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 		return -ENOMEM;
 
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->ops = &virtio_fs_pagemap_ops;
 
 	/* Ideally we would directly use the PCI BAR resource but
 	 * devm_memremap_pages() wants its own copy in pgmap.  So
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 18/19] fuse: Release file in process context
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (16 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 17/19] fuse: Add logic to free up a memory range Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  2019-08-21 17:57 ` [PATCH 19/19] fuse: Take inode lock for dax inode truncation Vivek Goyal
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().

iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).

That means it blocks worker thread and it stops processing further
replies and system deadlocks.

So for now, force sync release of file in case of DAX inodes.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2ff7624d58c0..e369a1f92d85 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -579,6 +579,7 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_file *ff = file->private_data;
 	struct fuse_req *req = ff->reserved_req;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+	bool sync = false;
 
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
@@ -599,8 +600,20 @@ void fuse_release_common(struct file *file, bool isdir)
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
+	 *
+	 * For DAX, fuse server is trusted. So it should be fine to
+	 * do a sync file put. Doing async file put is creating
+	 * problems right now because when request finish, iput()
+	 * can lead to freeing of inode. That means it tears down
+	 * mappings backing DAX memory and sends REMOVEMAPPING message
+	 * to server and blocks for completion. Currently, waiting
+	 * in req->end context deadlocks the system as same worker thread
+	 * can't process REMOVEMAPPING reply it is waiting for.
 	 */
-	fuse_file_put(ff, ff->fc->destroy_req != NULL, isdir);
+	if (IS_DAX(req->misc.release.inode) || ff->fc->destroy_req != NULL)
+		sync = true;
+
+	fuse_file_put(ff, sync, isdir);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 19/19] fuse: Take inode lock for dax inode truncation
  2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
                   ` (17 preceding siblings ...)
  2019-08-21 17:57 ` [PATCH 18/19] fuse: Release file in process context Vivek Goyal
@ 2019-08-21 17:57 ` Vivek Goyal
  18 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-21 17:57 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: virtio-fs, vgoyal, miklos, stefanha, dgilbert

When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.

In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.

Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.

So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e369a1f92d85..794c55131bd0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -524,7 +524,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 	int err;
 	bool lock_inode = (file->f_flags & O_TRUNC) &&
 			  fc->atomic_o_trunc &&
-			  fc->writeback_cache;
+			  (fc->writeback_cache || IS_DAX(inode));
 
 	err = generic_file_open(inode, file);
 	if (err)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/19] fuse, dax: Implement dax read/write operations
  2019-08-21 17:57 ` [PATCH 11/19] fuse, dax: Implement dax read/write operations Vivek Goyal
@ 2019-08-21 19:49   ` Liu Bo
  2019-08-22 12:59     ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Liu Bo @ 2019-08-21 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Miklos Szeredi, Peng Tao

On Wed, Aug 21, 2019 at 01:57:12PM -0400, Vivek Goyal wrote:
> This patch implements basic DAX support. mmap() is not implemented
> yet and will come in later patches. This patch looks into implemeting
> read/write.
> 
> We make use of interval tree to keep track of per inode dax mappings.
> 
> Do not use dax for file extending writes, instead just send WRITE message
> to daemon (like we do for direct I/O path). This will keep write and
> i_size change atomic w.r.t crash.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
> ---
>  fs/fuse/file.c            | 603 +++++++++++++++++++++++++++++++++++++-
>  fs/fuse/fuse_i.h          |  23 ++
>  fs/fuse/inode.c           |   6 +
>  include/uapi/linux/fuse.h |   1 +
>  4 files changed, 627 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index c45ffe6f1ecb..f323b7b04414 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -18,6 +18,12 @@
>  #include <linux/swap.h>
>  #include <linux/falloc.h>
>  #include <linux/uio.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/interval_tree_generic.h>
> +
> +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> +                     START, LAST, static inline, fuse_dax_interval_tree);
>  
>  static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
>  			  int opcode, struct fuse_open_out *outargp)
> @@ -171,6 +177,248 @@ static void fuse_link_write_file(struct file *file)
>  	spin_unlock(&fi->lock);
>  }
>  
> +static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> +{
> +	struct fuse_dax_mapping *dmap = NULL;
> +
> +	spin_lock(&fc->lock);
> +
> +	/* TODO: Add logic to try to free up memory if wait is allowed */
> +	if (fc->nr_free_ranges <= 0) {
> +		spin_unlock(&fc->lock);
> +		return NULL;
> +	}
> +
> +	WARN_ON(list_empty(&fc->free_ranges));
> +
> +	/* Take a free range */
> +	dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
> +					list);
> +	list_del_init(&dmap->list);
> +	fc->nr_free_ranges--;
> +	spin_unlock(&fc->lock);
> +	return dmap;
> +}
> +
> +/* This assumes fc->lock is held */
> +static void __dmap_add_to_free_pool(struct fuse_conn *fc,
> +				struct fuse_dax_mapping *dmap)
> +{
> +	list_add_tail(&dmap->list, &fc->free_ranges);
> +	fc->nr_free_ranges++;
> +}
> +
> +static void dmap_add_to_free_pool(struct fuse_conn *fc,
> +				struct fuse_dax_mapping *dmap)
> +{
> +	/* Return fuse_dax_mapping to free list */
> +	spin_lock(&fc->lock);
> +	__dmap_add_to_free_pool(fc, dmap);
> +	spin_unlock(&fc->lock);
> +}
> +
> +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> +				  struct fuse_dax_mapping *dmap, bool writable,
> +				  bool upgrade)
> +{
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_setupmapping_in inarg;
> +	FUSE_ARGS(args);
> +	ssize_t err;
> +
> +	WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> +	WARN_ON(fc->nr_free_ranges < 0);
> +
> +	/* Ask fuse daemon to setup mapping */
> +	memset(&inarg, 0, sizeof(inarg));
> +	inarg.foffset = offset;
> +	inarg.fh = -1;
> +	inarg.moffset = dmap->window_offset;
> +	inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> +	inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> +	if (writable)
> +		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> +	args.in.h.opcode = FUSE_SETUPMAPPING;
> +	args.in.h.nodeid = fi->nodeid;
> +	args.in.numargs = 1;
> +	args.in.args[0].size = sizeof(inarg);
> +	args.in.args[0].value = &inarg;
> +	err = fuse_simple_request(fc, &args);
> +	if (err < 0) {
> +		printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> +				 __func__, dmap->window_offset, err);
> +		return err;
> +	}
> +
> +	pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
> +		 " err=%zd\n", offset, writable, err);
> +
> +	dmap->writable = writable;
> +	if (!upgrade) {
> +		/* TODO: What locking is required here. For now,
> +		 * using fc->lock
> +		 */
> +		dmap->start = offset;
> +		dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> +		/* Protected by fi->i_dmap_sem */
> +		fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
> +		fi->nr_dmaps++;
> +	}
> +	return 0;
> +}
> +
> +static int
> +fuse_send_removemapping(struct inode *inode,
> +			struct fuse_removemapping_in *inargp,
> +			struct fuse_removemapping_one *remove_one)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	FUSE_ARGS(args);
> +
> +	args.in.h.opcode = FUSE_REMOVEMAPPING;
> +	args.in.h.nodeid = fi->nodeid;
> +	args.in.numargs = 2;
> +	args.in.args[0].size = sizeof(*inargp);
> +	args.in.args[0].value = inargp;
> +	args.in.args[1].size = inargp->count * sizeof(*remove_one);
> +	args.in.args[1].value = remove_one;
> +	return fuse_simple_request(fc, &args);
> +}
> +
> +static int dmap_removemapping_list(struct inode *inode, unsigned num,
> +				   struct list_head *to_remove)
> +{
> +	struct fuse_removemapping_one *remove_one, *ptr;
> +	struct fuse_removemapping_in inarg;
> +	struct fuse_dax_mapping *dmap;
> +	int ret, i = 0, nr_alloc;
> +
> +	nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
> +	remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
> +	if (!remove_one)
> +		return -ENOMEM;
> +
> +	ptr = remove_one;
> +	list_for_each_entry(dmap, to_remove, list) {
> +		ptr->moffset = dmap->window_offset;
> +		ptr->len = dmap->length;
> +		ptr++;
> +		i++;
> +		num--;
> +		if (i >= nr_alloc || num == 0) {
> +			memset(&inarg, 0, sizeof(inarg));
> +			inarg.count = i;
> +			ret = fuse_send_removemapping(inode, &inarg,
> +						      remove_one);
> +			if (ret)
> +				goto out;
> +			ptr = remove_one;
> +			i = 0;
> +		}
> +	}
> +out:
> +	kfree(remove_one);
> +	return ret;
> +}
> +
> +/*
> + * Cleanup dmap entry and add back to free list. This should be called with
> + * fc->lock held.
> + */
> +static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
> +					    struct fuse_dax_mapping *dmap)
> +{
> +	pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
> +		 "window_offset=0x%llx length=0x%llx\n", dmap->start,
> +		 dmap->end, dmap->window_offset, dmap->length);
> +	dmap->start = dmap->end = 0;
> +	__dmap_add_to_free_pool(fc, dmap);
> +}
> +
> +/*
> + * Free inode dmap entries whose range falls entirely inside [start, end].
> + * Does not take any locks. Caller must take care of any lock requirements.
> + * Lock ordering follows fuse_dax_free_one_mapping().
> + * inode->i_rwsem, fuse_inode->i_mmap_sem and fuse_inode->i_dmap_sem must be
> + * held exclusively, unless it is called from evict_inode() where no one else
> + * is accessing the inode.
> + */
> +static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> +				      loff_t start, loff_t end)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_dax_mapping *dmap, *n;
> +	int err, num = 0;
> +	LIST_HEAD(to_remove);
> +
> +	pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
> +
> +	/*
> +	 * Interval tree search matches intersecting entries. Adjust the range
> +	 * to avoid dropping partial valid entries.
> +	 */
> +	start = ALIGN(start, FUSE_DAX_MEM_RANGE_SZ);
> +	end = ALIGN_DOWN(end, FUSE_DAX_MEM_RANGE_SZ);
> +
> +	while (1) {
> +		dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, start,
> +							 end);
> +		if (!dmap)
> +			break;
> +		fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> +		num++;
> +		list_add(&dmap->list, &to_remove);
> +	}
> +
> +	/* Nothing to remove */
> +	if (list_empty(&to_remove))
> +		return;
> +
> +	WARN_ON(fi->nr_dmaps < num);
> +	fi->nr_dmaps -= num;
> +	/*
> +	 * During umount/shutdown, fuse connection is dropped first
> +	 * and evict_inode() is called later. That means any
> +	 * removemapping messages are going to fail. Send messages
> +	 * only if connection is up. Otherwise fuse daemon is
> +	 * responsible for cleaning up any leftover references and
> +	 * mappings.
> +	 */
> +	if (fc->connected) {
> +		err = dmap_removemapping_list(inode, num, &to_remove);
> +		if (err) {
> +			pr_warn("Failed to removemappings. start=0x%llx"
> +				" end=0x%llx\n", start, end);
> +		}
> +	}
> +	spin_lock(&fc->lock);
> +	list_for_each_entry_safe(dmap, n, &to_remove, list) {
> +		list_del_init(&dmap->list);
> +		dmap_reinit_add_to_free_pool(fc, dmap);
> +	}
> +	spin_unlock(&fc->lock);
> +}
> +
> +/*
> + * It is called from evict_inode() and by that time inode is going away. So
> + * this function does not take any locks like fi->i_dmap_sem for traversing
> + * that fuse inode interval tree. If that lock is taken then lock validator
> + * complains of deadlock situation w.r.t fs_reclaim lock.
> + */
> +void fuse_cleanup_inode_mappings(struct inode *inode)
> +{
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	/*
> +	 * fuse_evict_inode() has alredy called truncate_inode_pages_final()
> +	 * before we arrive here. So we should not have to worry about
> +	 * any pages/exception entries still associated with inode.
> +	 */
> +	inode_reclaim_dmap_range(fc, inode, 0, -1);
> +}
> +
>  void fuse_finish_open(struct inode *inode, struct file *file)
>  {
>  	struct fuse_file *ff = file->private_data;
> @@ -1481,32 +1729,364 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	return res;
>  }
>  
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct fuse_file *ff = file->private_data;
> +	struct inode *inode = file->f_mapping->host;
>  
>  	if (is_bad_inode(file_inode(file)))
>  		return -EIO;
>  
> -	if (!(ff->open_flags & FOPEN_DIRECT_IO))
> -		return fuse_cache_read_iter(iocb, to);
> -	else
> +	if (IS_DAX(inode))
> +		return fuse_dax_read_iter(iocb, to);
> +
> +	if (ff->open_flags & FOPEN_DIRECT_IO)
>  		return fuse_direct_read_iter(iocb, to);
> +
> +	return fuse_cache_read_iter(iocb, to);
>  }
>  
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
>  static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct fuse_file *ff = file->private_data;
> +	struct inode *inode = file->f_mapping->host;
>  
>  	if (is_bad_inode(file_inode(file)))
>  		return -EIO;
>  
> -	if (!(ff->open_flags & FOPEN_DIRECT_IO))
> -		return fuse_cache_write_iter(iocb, from);
> -	else
> +	if (IS_DAX(inode))
> +		return fuse_dax_write_iter(iocb, from);
> +
> +	if (ff->open_flags & FOPEN_DIRECT_IO)
>  		return fuse_direct_write_iter(iocb, from);
> +
> +	return fuse_cache_write_iter(iocb, from);
> +}
> +
> +static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
> +{
> +	iomap->addr = IOMAP_NULL_ADDR;
> +	iomap->length = length;
> +	iomap->type = IOMAP_HOLE;
> +}
> +
> +static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> +			struct iomap *iomap, struct fuse_dax_mapping *dmap,
> +			unsigned flags)
> +{
> +	loff_t offset, len;
> +	loff_t i_size = i_size_read(inode);
> +
> +	offset = pos - dmap->start;
> +	len = min(length, dmap->length - offset);
> +
> +	/* If length is beyond end of file, truncate further */
> +	if (pos + len > i_size)
> +		len = i_size - pos;
> +
> +	if (len > 0) {
> +		iomap->addr = dmap->window_offset + offset;
> +		iomap->length = len;
> +		if (flags & IOMAP_FAULT)
> +			iomap->length = ALIGN(len, PAGE_SIZE);
> +		iomap->type = IOMAP_MAPPED;
> +		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> +				" length 0x%llx\n", __func__, iomap->addr,
> +				iomap->offset, iomap->length);
> +	} else {
> +		/* Mapping beyond end of file is hole */
> +		fuse_fill_iomap_hole(iomap, length);
> +		pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> +				"length 0x%llx\n", __func__, iomap->addr,
> +				iomap->offset, iomap->length);
> +	}
> +}
> +
> +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> +					 loff_t length, unsigned flags,
> +					 struct iomap *iomap)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
> +	int ret;
> +	bool writable = flags & IOMAP_WRITE;
> +
> +	alloc_dmap = alloc_dax_mapping(fc);
> +	if (!alloc_dmap)
> +		return -EBUSY;
> +
> +	/*
> +	 * Take write lock so that only one caller can try to setup mapping
> +	 * and other waits.
> +	 */
> +	down_write(&fi->i_dmap_sem);
> +	/*
> +	 * We dropped lock. Check again if somebody else setup
> +	 * mapping already.
> +	 */
> +	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
> +						pos);
> +	if (dmap) {
> +		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +		dmap_add_to_free_pool(fc, alloc_dmap);
> +		up_write(&fi->i_dmap_sem);
> +		return 0;
> +	}
> +
> +	/* Setup one mapping */
> +	ret = fuse_setup_one_mapping(inode,
> +				     ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +				     alloc_dmap, writable, false);
> +	if (ret < 0) {
> +		printk("fuse_setup_one_mapping() failed. err=%d"
> +			" pos=0x%llx, writable=%d\n", ret, pos, writable);
> +		dmap_add_to_free_pool(fc, alloc_dmap);
> +		up_write(&fi->i_dmap_sem);
> +		return ret;
> +	}
> +	fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
> +	up_write(&fi->i_dmap_sem);
> +	return 0;
> +}
> +
> +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> +					 loff_t length, unsigned flags,
> +					 struct iomap *iomap)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_dax_mapping *dmap;
> +	int ret;
> +
> +	/*
> +	 * Take exclusive lock so that only one caller can try to setup
> +	 * mapping and others wait.
> +	 */
> +	down_write(&fi->i_dmap_sem);
> +	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> +	/* We are holding either inode lock or i_mmap_sem, and that should
> +	 * ensure that dmap can't reclaimed or truncated and it should still
> +	 * be there in tree despite the fact we dropped and re-acquired the
> +	 * lock.
> +	 */
> +	ret = -EIO;
> +	if (WARN_ON(!dmap))
> +		goto out_err;
> +
> +	/* Maybe another thread already upgraded mapping while we were not
> +	 * holding lock.
> +	 */
> +	if (dmap->writable)
> +		goto out_fill_iomap;

@ret needs to be reset here.

thanks,
-liubo

> +
> +	ret = fuse_setup_one_mapping(inode,
> +				     ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> +				     dmap, true, true);
> +	if (ret < 0) {
> +		printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> +		       ret, pos);
> +		goto out_err;
> +	}
> +
> +out_fill_iomap:
> +	fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +out_err:
> +	up_write(&fi->i_dmap_sem);
> +	return ret;
> +}
> +
> +/* This is just for DAX and the mapping is ephemeral, do not use it for other
> + * purposes since there is no block device with a permanent mapping.
> + */
> +static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> +			    unsigned flags, struct iomap *iomap)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_dax_mapping *dmap;
> +	bool writable = flags & IOMAP_WRITE;
> +
> +	/* We don't support FIEMAP */
> +	BUG_ON(flags & IOMAP_REPORT);
> +
> +	pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
> +			pos, length);
> +
> +	/*
> +	 * Writes beyond end of file are not handled using dax path. Instead
> +	 * a fuse write message is sent to daemon
> +	 */
> +	if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
> +		return -EIO;
> +
> +	iomap->offset = pos;
> +	iomap->flags = 0;
> +	iomap->bdev = NULL;
> +	iomap->dax_dev = fc->dax_dev;
> +
> +	/*
> +	 * Both read/write and mmap path can race here. So we need something
> +	 * to make sure if we are setting up mapping, then other path waits
> +	 *
> +	 * For now, use a semaphore for this. It probably needs to be
> +	 * optimized later.
> +	 */
> +	down_read(&fi->i_dmap_sem);
> +	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> +	if (dmap) {
> +		if (writable && !dmap->writable) {
> +			/* Upgrade read-only mapping to read-write. This will
> +			 * require exclusive i_dmap_sem lock as we don't want
> +			 * two threads to be trying to this simultaneously
> +			 * for same dmap. So drop shared lock and acquire
> +			 * exclusive lock.
> +			 */
> +			up_read(&fi->i_dmap_sem);
> +			pr_debug("%s: Upgrading mapping at offset 0x%llx"
> +				 " length 0x%llx\n", __func__, pos, length);
> +			return iomap_begin_upgrade_mapping(inode, pos, length,
> +							   flags, iomap);
> +		} else {
> +			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +			up_read(&fi->i_dmap_sem);
> +			return 0;
> +		}
> +	} else {
> +		up_read(&fi->i_dmap_sem);
> +		pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
> +				__func__, pos, length);
> +		if (pos >= i_size_read(inode))
> +			goto iomap_hole;
> +
> +		return iomap_begin_setup_new_mapping(inode, pos, length, flags,
> +						     iomap);
> +	}
> +
> +	/*
> +	 * If read beyond end of file happnes, fs code seems to return
> +	 * it as hole
> +	 */
> +iomap_hole:
> +	fuse_fill_iomap_hole(iomap, length);
> +	pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
> +	return 0;
> +}
> +
> +static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> +			  ssize_t written, unsigned flags,
> +			  struct iomap *iomap)
> +{
> +	/* DAX writes beyond end-of-file aren't handled using iomap, so the
> +	 * file size is unchanged and there is nothing to do here.
> +	 */
> +	return 0;
> +}
> +
> +static const struct iomap_ops fuse_iomap_ops = {
> +	.iomap_begin = fuse_iomap_begin,
> +	.iomap_end = fuse_iomap_end,
> +};
> +
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	ssize_t ret;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock_shared(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock_shared(inode);
> +	}
> +
> +	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
> +	inode_unlock_shared(inode);
> +
> +	/* TODO file_accessed(iocb->f_filp) */
> +
> +	return ret;
> +}
> +
> +static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	return (iov_iter_rw(from) == WRITE &&
> +		((iocb->ki_pos) >= i_size_read(inode)));
> +}
> +
> +static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
> +	ssize_t ret;
> +
> +	ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
> +	if (ret < 0)
> +		return ret;
> +
> +	fuse_invalidate_attr(inode);
> +	fuse_write_update_size(inode, iocb->ki_pos);
> +	return ret;
> +}
> +
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	ssize_t ret, count;
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!inode_trylock(inode))
> +			return -EAGAIN;
> +	} else {
> +		inode_lock(inode);
> +	}
> +
> +	ret = generic_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	ret = file_remove_privs(iocb->ki_filp);
> +	if (ret)
> +		goto out;
> +	/* TODO file_update_time() but we don't want metadata I/O */
> +
> +	/* Do not use dax for file extending writes as its an mmap and
> +	 * trying to write beyong end of existing page will generate
> +	 * SIGBUS.
> +	 */
> +	if (file_extending_write(iocb, from)) {
> +		ret = fuse_dax_direct_write(iocb, from);
> +		goto out;
> +	}
> +
> +	ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
> +	if (ret < 0)
> +		goto out;
> +
> +	/*
> +	 * If part of the write was file extending, fuse dax path will not
> +	 * take care of that. Do direct write instead.
> +	 */
> +	if (iov_iter_count(from) && file_extending_write(iocb, from)) {
> +		count = fuse_dax_direct_write(iocb, from);
> +		if (count < 0)
> +			goto out;
> +		ret += count;
> +	}
> +
> +out:
> +	inode_unlock(inode);
> +
> +	if (ret > 0)
> +		ret = generic_write_sync(iocb, ret);
> +	return ret;
>  }
>  
>  static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
> @@ -2185,6 +2765,11 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
>  
>  }
>  
> +static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	return -EINVAL; /* TODO */
> +}
> +
>  static int convert_fuse_file_lock(struct fuse_conn *fc,
>  				  const struct fuse_file_lock *ffl,
>  				  struct file_lock *fl)
> @@ -3266,6 +3851,7 @@ static const struct address_space_operations fuse_file_aops  = {
>  void fuse_init_file_inode(struct inode *inode)
>  {
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
>  
>  	inode->i_fop = &fuse_file_operations;
>  	inode->i_data.a_ops = &fuse_file_aops;
> @@ -3275,4 +3861,9 @@ void fuse_init_file_inode(struct inode *inode)
>  	fi->writectr = 0;
>  	init_waitqueue_head(&fi->page_waitq);
>  	INIT_LIST_HEAD(&fi->writepages);
> +	fi->dmap_tree = RB_ROOT_CACHED;
> +
> +	if (fc->dax_dev) {
> +		inode->i_flags |= S_DAX;
> +	}
>  }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index b020a4071f80..37b31c5435ff 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -104,16 +104,29 @@ struct fuse_forget_link {
>  	struct fuse_forget_link *next;
>  };
>  
> +#define START(node) ((node)->start)
> +#define LAST(node) ((node)->end)
> +
>  /** Translation information for file offsets to DAX window offsets */
>  struct fuse_dax_mapping {
>  	/* Will connect in fc->free_ranges to keep track of free memory */
>  	struct list_head list;
>  
> +	/* For interval tree in file/inode */
> +	struct rb_node rb;
> +	/** Start Position in file */
> +	__u64 start;
> +	/** End Position in file */
> +	__u64 end;
> +	__u64 __subtree_last;
>  	/** Position in DAX window */
>  	u64 window_offset;
>  
>  	/** Length of mapping, in bytes */
>  	loff_t length;
> +
> +	/* Is this mapping read-only or read-write */
> +	bool writable;
>  };
>  
>  /** FUSE inode */
> @@ -201,6 +214,15 @@ struct fuse_inode {
>  
>  	/** Lock to protect write related fields */
>  	spinlock_t lock;
> +
> +	/*
> +	 * Semaphore to protect modifications to dmap_tree
> +	 */
> +	struct rw_semaphore i_dmap_sem;
> +
> +	/** Sorted rb tree of struct fuse_dax_mapping elements */
> +	struct rb_root_cached dmap_tree;
> +	unsigned long nr_dmaps;
>  };
>  
>  /** FUSE inode state bits */
> @@ -1242,5 +1264,6 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
>   */
>  u64 fuse_get_unique(struct fuse_iqueue *fiq);
>  void fuse_free_conn(struct fuse_conn *fc);
> +void fuse_cleanup_inode_mappings(struct inode *inode);
>  
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index d5d134a01117..7e0ed5f3f7e6 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -81,7 +81,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>  	fi->attr_version = 0;
>  	fi->orig_ino = 0;
>  	fi->state = 0;
> +	fi->nr_dmaps = 0;
>  	mutex_init(&fi->mutex);
> +	init_rwsem(&fi->i_dmap_sem);
>  	spin_lock_init(&fi->lock);
>  	fi->forget = fuse_alloc_forget();
>  	if (!fi->forget) {
> @@ -109,6 +111,10 @@ static void fuse_evict_inode(struct inode *inode)
>  	clear_inode(inode);
>  	if (inode->i_sb->s_flags & SB_ACTIVE) {
>  		struct fuse_conn *fc = get_fuse_conn(inode);
> +		if (IS_DAX(inode)) {
> +			fuse_cleanup_inode_mappings(inode);
> +			WARN_ON(fi->nr_dmaps);
> +		}
>  		fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
>  		fi->forget = NULL;
>  	}
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 7c2ad3d418df..ac23f57d8fd6 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -854,6 +854,7 @@ struct fuse_copy_file_range_in {
>  
>  #define FUSE_SETUPMAPPING_ENTRIES 8
>  #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> +#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
>  struct fuse_setupmapping_in {
>  	/* An already open handle */
>  	uint64_t	fh;
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/19] fuse, dax: Implement dax read/write operations
  2019-08-21 19:49   ` Liu Bo
@ 2019-08-22 12:59     ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-22 12:59 UTC (permalink / raw)
  To: Liu Bo
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Miklos Szeredi, Peng Tao

On Wed, Aug 21, 2019 at 12:49:34PM -0700, Liu Bo wrote:

[..]
> > +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> > +					 loff_t length, unsigned flags,
> > +					 struct iomap *iomap)
> > +{
> > +	struct fuse_inode *fi = get_fuse_inode(inode);
> > +	struct fuse_dax_mapping *dmap;
> > +	int ret;
> > +
> > +	/*
> > +	 * Take exclusive lock so that only one caller can try to setup
> > +	 * mapping and others wait.
> > +	 */
> > +	down_write(&fi->i_dmap_sem);
> > +	dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> > +
> > +	/* We are holding either inode lock or i_mmap_sem, and that should
> > +	 * ensure that dmap can't reclaimed or truncated and it should still
> > +	 * be there in tree despite the fact we dropped and re-acquired the
> > +	 * lock.
> > +	 */
> > +	ret = -EIO;
> > +	if (WARN_ON(!dmap))
> > +		goto out_err;
> > +
> > +	/* Maybe another thread already upgraded mapping while we were not
> > +	 * holding lock.
> > +	 */
> > +	if (dmap->writable)
> > +		goto out_fill_iomap;
> 
> @ret needs to be reset here.
> 

Good catch. Will fix it.

Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Virtio-fs] [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-21 17:57 ` [PATCH 04/19] virtio: Implement get_shm_region for PCI transport Vivek Goyal
@ 2019-08-26  1:43   ` piaojun
  2019-08-26 13:06     ` Vivek Goyal
  2019-08-27  8:34   ` Cornelia Huck
  1 sibling, 1 reply; 77+ messages in thread
From: piaojun @ 2019-08-26  1:43 UTC (permalink / raw)
  To: Vivek Goyal, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: kbuild test robot, kvm, miklos, virtio-fs, Sebastien Boeuf



On 2019/8/22 1:57, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> On PCI the shm regions are found using capability entries;
> find a region by searching for the capability.
> 
> Cc: kvm@vger.kernel.org
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: kbuild test robot <lkp@intel.com>
> ---
>  drivers/virtio/virtio_pci_modern.c | 108 +++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_pci.h    |  11 ++-
>  2 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 7abcc50838b8..1cdedd93f42a 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -443,6 +443,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
>  	vring_del_virtqueue(vq);
>  }
>  
> +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> +                                   u8 required_id,
> +                                   u8 *bar, u64 *offset, u64 *len)
> +{
> +	int pos;
> +
> +        for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> +             pos > 0;
> +             pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> +		u8 type, cap_len, id;
> +                u32 tmp32;
> +                u64 res_offset, res_length;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         cfg_type),
> +                                     &type);
> +                if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
> +                        continue;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         cap_len),
> +                                     &cap_len);
> +		if (cap_len != sizeof(struct virtio_pci_cap64)) {
> +		        printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
> +                               __func__, pos, cap_len);
> +                        continue;
> +                }
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         id),
> +                                     &id);
> +                if (id != required_id)
> +                        continue;
> +
> +                /* Type, and ID match, looks good */
> +                pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         bar),
> +                                     bar);
> +
> +                /* Read the lower 32bit of length and offset */
> +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
> +                                      &tmp32);
> +                res_offset = tmp32;
> +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
> +                                      &tmp32);
> +                res_length = tmp32;
> +
> +                /* and now the top half */
> +                pci_read_config_dword(dev,
> +                                      pos + offsetof(struct virtio_pci_cap64,
> +                                                     offset_hi),
> +                                      &tmp32);
> +                res_offset |= ((u64)tmp32) << 32;
> +                pci_read_config_dword(dev,
> +                                      pos + offsetof(struct virtio_pci_cap64,
> +                                                     length_hi),
> +                                      &tmp32);
> +                res_length |= ((u64)tmp32) << 32;
> +
> +                *offset = res_offset;
> +                *len = res_length;
> +
> +                return pos;
> +        }
> +        return 0;
> +}
> +
> +static bool vp_get_shm_region(struct virtio_device *vdev,
> +			      struct virtio_shm_region *region, u8 id)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> +	u8 bar;
> +	u64 offset, len;
> +	phys_addr_t phys_addr;
> +	size_t bar_len;
> +	char *bar_name;

'char *bar_name' should be cleaned up to avoid compiling warning. And I
wonder if you mix tab and blankspace for code indent? Or it's just my
email display problem?

Thanks,
Jun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-21 17:57 ` [PATCH 01/19] dax: remove block device dependencies Vivek Goyal
@ 2019-08-26 11:51   ` Christoph Hellwig
  2019-08-27 16:38     ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Hellwig @ 2019-08-26 11:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Dan Williams

On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote:
> From: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Although struct dax_device itself is not tied to a block device, some
> DAX code assumes there is a block device.  Make block devices optional
> by allowing bdev to be NULL in commonly used DAX APIs.
> 
> When there is no block device:
>  * Skip the partition offset calculation in bdev_dax_pgoff()
>  * Skip the blkdev_issue_zeroout() optimization
> 
> Note that more block device assumptions remain but I haven't reach those
> code paths yet.

I think this should be split into two patches.  For bdev_dax_pgoff
I'd much rather have the partition offset if there is on in the daxdev
somehow so that we can get rid of the block device entirely.

Similarly for dax_range_is_aligned I'd rather have a pure dax way
to offload zeroing rather than this bdev hack.

In the long run I'd really like to make the bdev vs daxdev in iomap a
union instead of having to carry both around.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-21 17:57 ` [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
@ 2019-08-26 11:53   ` Christoph Hellwig
  2019-08-26 20:33     ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Hellwig @ 2019-08-26 11:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Dan Williams

On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> is searched from that bdev name.
> 
> virtio-fs does not have a bdev. So pass in dax_dev also to
> dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> used otherwise dax_dev is searched using bdev.

Please just pass in only the dax_device and get rid of the block device.
The callers should have one at hand easily, e.g. for XFS just call
xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Virtio-fs] [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-26  1:43   ` [Virtio-fs] " piaojun
@ 2019-08-26 13:06     ` Vivek Goyal
  2019-08-27  9:41       ` piaojun
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-26 13:06 UTC (permalink / raw)
  To: piaojun
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, kbuild test robot,
	kvm, miklos, virtio-fs, Sebastien Boeuf

On Mon, Aug 26, 2019 at 09:43:08AM +0800, piaojun wrote:

[..]
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > +			      struct virtio_shm_region *region, u8 id)
> > +{
> > +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +	u8 bar;
> > +	u64 offset, len;
> > +	phys_addr_t phys_addr;
> > +	size_t bar_len;
> > +	char *bar_name;
> 
> 'char *bar_name' should be cleaned up to avoid compiling warning. And I
> wonder if you mix tab and blankspace for code indent? Or it's just my
> email display problem?

Will get rid of now unused bar_name. 

Generally git flags if there are tab/space issues. I did not see any. So
if you see something, point it out and I will fix it.

Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 11:53   ` Christoph Hellwig
@ 2019-08-26 20:33     ` Vivek Goyal
  2019-08-26 20:58       ` Vivek Goyal
  2019-08-27 13:45       ` Jan Kara
  0 siblings, 2 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-26 20:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Dan Williams

On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > is searched from that bdev name.
> > 
> > virtio-fs does not have a bdev. So pass in dax_dev also to
> > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > used otherwise dax_dev is searched using bdev.
> 
> Please just pass in only the dax_device and get rid of the block device.
> The callers should have one at hand easily, e.g. for XFS just call
> xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.

Sure. Here is the updated patch.

This patch can probably go upstream independently. If you are fine with
the patch, I can post it separately for inclusion.


Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()

As of now dax_writeback_mapping_range() takes "struct block_device" as a
parameter and dax_dev is searched from bdev name. This also involves taking
a fresh reference on dax_dev and putting that reference at the end of
function.

We are developing a new filesystem virtio-fs and using dax to access host
page cache directly. But there is no block device. IOW, we want to make
use of dax but want to get rid of this assumption that there is always
a block device associated with dax_dev.

So pass in "struct dax_device" as parameter instead of bdev.

ext2/ext4/xfs are current users and they already have a reference on
dax_device. So there is no need to take reference and drop reference to
dax_device on each call of this function.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            |    8 +-------
 fs/ext2/inode.c     |    5 +++--
 fs/ext4/inode.c     |    2 +-
 fs/xfs/xfs_aops.c   |    2 +-
 include/linux/dax.h |    2 +-
 5 files changed, 7 insertions(+), 12 deletions(-)

Index: rhvgoyal-linux-fuse/fs/dax.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/dax.c	2019-08-26 11:20:36.545009968 -0400
+++ rhvgoyal-linux-fuse/fs/dax.c	2019-08-26 11:24:43.973009968 -0400
@@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
 	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
 	struct inode *inode = mapping->host;
 	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-	struct dax_device *dax_dev;
 	void *entry;
 	int ret = 0;
 	unsigned int scanned = 0;
@@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
 	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
 		return 0;
 
-	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-	if (!dax_dev)
-		return -EIO;
-
 	trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
 	tag_pages_for_writeback(mapping, xas.xa_index, end_index);
@@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
 		xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
-	put_dax(dax_dev);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
 	return ret;
 }
Index: rhvgoyal-linux-fuse/include/linux/dax.h
===================================================================
--- rhvgoyal-linux-fuse.orig/include/linux/dax.h	2019-08-26 11:20:36.545009968 -0400
+++ rhvgoyal-linux-fuse/include/linux/dax.h	2019-08-26 11:26:08.384009968 -0400
@@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
+		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c	2019-08-26 11:20:36.545009968 -0400
+++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c	2019-08-26 11:34:51.085009968 -0400
@@ -1120,7 +1120,7 @@ xfs_dax_writepages(
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return dax_writeback_mapping_range(mapping,
-			xfs_find_bdev_for_inode(mapping->host), wbc);
+			xfs_find_daxdev_for_inode(mapping->host), wbc);
 }
 
 STATIC int
Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c	2019-08-26 11:20:36.545009968 -0400
+++ rhvgoyal-linux-fuse/fs/ext4/inode.c	2019-08-26 11:39:56.828009968 -0400
@@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct ad
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 	trace_ext4_writepages_result(inode, wbc, ret,
 				     nr_to_write - wbc->nr_to_write);
 	percpu_up_read(&sbi->s_journal_flag_rwsem);
Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c	2019-08-26 11:20:36.545009968 -0400
+++ rhvgoyal-linux-fuse/fs/ext2/inode.c	2019-08-26 11:43:04.842009968 -0400
@@ -957,8 +957,9 @@ ext2_writepages(struct address_space *ma
 static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-	return dax_writeback_mapping_range(mapping,
-			mapping->host->i_sb->s_bdev, wbc);
+	struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
+
+	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 }
 
 const struct address_space_operations ext2_aops = {

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 20:33     ` Vivek Goyal
@ 2019-08-26 20:58       ` Vivek Goyal
  2019-08-26 21:33         ` Dan Williams
                           ` (2 more replies)
  2019-08-27 13:45       ` Jan Kara
  1 sibling, 3 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-26 20:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Dan Williams

On Mon, Aug 26, 2019 at 04:33:26PM -0400, Vivek Goyal wrote:
> On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > is searched from that bdev name.
> > > 
> > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > used otherwise dax_dev is searched using bdev.
> > 
> > Please just pass in only the dax_device and get rid of the block device.
> > The callers should have one at hand easily, e.g. for XFS just call
> > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> 
> Sure. Here is the updated patch.
> 
> This patch can probably go upstream independently. If you are fine with
> the patch, I can post it separately for inclusion.

Forgot to update function declaration in case of !CONFIG_FS_DAX. Here is
the updated patch.

Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()

As of now dax_writeback_mapping_range() takes "struct block_device" as a
parameter and dax_dev is searched from bdev name. This also involves taking
a fresh reference on dax_dev and putting that reference at the end of
function.

We are developing a new filesystem virtio-fs and using dax to access host
page cache directly. But there is no block device. IOW, we want to make
use of dax but want to get rid of this assumption that there is always
a block device associated with dax_dev.

So pass in "struct dax_device" as parameter instead of bdev.

ext2/ext4/xfs are current users and they already have a reference on
dax_device. So there is no need to take reference and drop reference to
dax_device on each call of this function.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            |    8 +-------
 fs/ext2/inode.c     |    5 +++--
 fs/ext4/inode.c     |    2 +-
 fs/xfs/xfs_aops.c   |    2 +-
 include/linux/dax.h |    4 ++--
 5 files changed, 8 insertions(+), 13 deletions(-)

Index: rhvgoyal-linux-fuse/fs/dax.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/dax.c	2019-08-26 16:45:26.093710196 -0400
+++ rhvgoyal-linux-fuse/fs/dax.c	2019-08-26 16:45:29.462710196 -0400
@@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
 	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
 	struct inode *inode = mapping->host;
 	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-	struct dax_device *dax_dev;
 	void *entry;
 	int ret = 0;
 	unsigned int scanned = 0;
@@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
 	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
 		return 0;
 
-	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-	if (!dax_dev)
-		return -EIO;
-
 	trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
 	tag_pages_for_writeback(mapping, xas.xa_index, end_index);
@@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
 		xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
-	put_dax(dax_dev);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
 	return ret;
 }
Index: rhvgoyal-linux-fuse/include/linux/dax.h
===================================================================
--- rhvgoyal-linux-fuse.orig/include/linux/dax.h	2019-08-26 16:45:26.094710196 -0400
+++ rhvgoyal-linux-fuse/include/linux/dax.h	2019-08-26 16:46:08.101710196 -0400
@@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
+		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
@@ -180,7 +180,7 @@ static inline struct page *dax_layout_bu
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
 	return -EOPNOTSUPP;
 }
Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c	2019-08-26 16:45:26.094710196 -0400
+++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c	2019-08-26 16:45:29.471710196 -0400
@@ -1120,7 +1120,7 @@ xfs_dax_writepages(
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return dax_writeback_mapping_range(mapping,
-			xfs_find_bdev_for_inode(mapping->host), wbc);
+			xfs_find_daxdev_for_inode(mapping->host), wbc);
 }
 
 STATIC int
Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c	2019-08-26 16:45:26.093710196 -0400
+++ rhvgoyal-linux-fuse/fs/ext4/inode.c	2019-08-26 16:45:29.475710196 -0400
@@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct ad
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 	trace_ext4_writepages_result(inode, wbc, ret,
 				     nr_to_write - wbc->nr_to_write);
 	percpu_up_read(&sbi->s_journal_flag_rwsem);
Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c	2019-08-26 16:45:26.093710196 -0400
+++ rhvgoyal-linux-fuse/fs/ext2/inode.c	2019-08-26 16:45:29.477710196 -0400
@@ -957,8 +957,9 @@ ext2_writepages(struct address_space *ma
 static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-	return dax_writeback_mapping_range(mapping,
-			mapping->host->i_sb->s_bdev, wbc);
+	struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
+
+	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 }
 
 const struct address_space_operations ext2_aops = {

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 20:58       ` Vivek Goyal
@ 2019-08-26 21:33         ` Dan Williams
  2019-08-28  6:58         ` Christoph Hellwig
  2020-01-03 14:12         ` Vivek Goyal
  2 siblings, 0 replies; 77+ messages in thread
From: Dan Williams @ 2019-08-26 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, linux-fsdevel, Linux Kernel Mailing List,
	linux-nvdimm, virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Jan Kara

[ add Jan ]

On Mon, Aug 26, 2019 at 1:58 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Aug 26, 2019 at 04:33:26PM -0400, Vivek Goyal wrote:
> > On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > > is searched from that bdev name.
> > > >
> > > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > > used otherwise dax_dev is searched using bdev.
> > >
> > > Please just pass in only the dax_device and get rid of the block device.
> > > The callers should have one at hand easily, e.g. for XFS just call
> > > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> >
> > Sure. Here is the updated patch.
> >
> > This patch can probably go upstream independently. If you are fine with
> > the patch, I can post it separately for inclusion.
>
> Forgot to update function declaration in case of !CONFIG_FS_DAX. Here is
> the updated patch.
>
> Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
>
> As of now dax_writeback_mapping_range() takes "struct block_device" as a
> parameter and dax_dev is searched from bdev name. This also involves taking
> a fresh reference on dax_dev and putting that reference at the end of
> function.
>
> We are developing a new filesystem virtio-fs and using dax to access host
> page cache directly. But there is no block device. IOW, we want to make
> use of dax but want to get rid of this assumption that there is always
> a block device associated with dax_dev.
>
> So pass in "struct dax_device" as parameter instead of bdev.
>
> ext2/ext4/xfs are current users and they already have a reference on
> dax_device. So there is no need to take reference and drop reference to
> dax_device on each call of this function.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/dax.c            |    8 +-------
>  fs/ext2/inode.c     |    5 +++--
>  fs/ext4/inode.c     |    2 +-
>  fs/xfs/xfs_aops.c   |    2 +-
>  include/linux/dax.h |    4 ++--

Looks good to me. Would be nice to get some ext4 and xfs acks then
I'll take it through the dax tree for v5.4.

>  5 files changed, 8 insertions(+), 13 deletions(-)
>
> Index: rhvgoyal-linux-fuse/fs/dax.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/dax.c   2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/dax.c        2019-08-26 16:45:29.462710196 -0400
> @@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
>   * on persistent storage prior to completion of the operation.
>   */
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -               struct block_device *bdev, struct writeback_control *wbc)
> +               struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>         XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
>         struct inode *inode = mapping->host;
>         pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
> -       struct dax_device *dax_dev;
>         void *entry;
>         int ret = 0;
>         unsigned int scanned = 0;
> @@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
>         if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
>                 return 0;
>
> -       dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> -       if (!dax_dev)
> -               return -EIO;
> -
>         trace_dax_writeback_range(inode, xas.xa_index, end_index);
>
>         tag_pages_for_writeback(mapping, xas.xa_index, end_index);
> @@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
>                 xas_lock_irq(&xas);
>         }
>         xas_unlock_irq(&xas);
> -       put_dax(dax_dev);
>         trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
>         return ret;
>  }
> Index: rhvgoyal-linux-fuse/include/linux/dax.h
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/include/linux/dax.h        2019-08-26 16:45:26.094710196 -0400
> +++ rhvgoyal-linux-fuse/include/linux/dax.h     2019-08-26 16:46:08.101710196 -0400
> @@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
>
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -               struct block_device *bdev, struct writeback_control *wbc);
> +               struct dax_device *dax_dev, struct writeback_control *wbc);
>
>  struct page *dax_layout_busy_page(struct address_space *mapping);
>  dax_entry_t dax_lock_page(struct page *page);
> @@ -180,7 +180,7 @@ static inline struct page *dax_layout_bu
>  }
>
>  static inline int dax_writeback_mapping_range(struct address_space *mapping,
> -               struct block_device *bdev, struct writeback_control *wbc)
> +               struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>         return -EOPNOTSUPP;
>  }
> Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c  2019-08-26 16:45:26.094710196 -0400
> +++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c       2019-08-26 16:45:29.471710196 -0400
> @@ -1120,7 +1120,7 @@ xfs_dax_writepages(
>  {
>         xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>         return dax_writeback_mapping_range(mapping,
> -                       xfs_find_bdev_for_inode(mapping->host), wbc);
> +                       xfs_find_daxdev_for_inode(mapping->host), wbc);
>  }
>
>  STATIC int
> Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c    2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/ext4/inode.c 2019-08-26 16:45:29.475710196 -0400
> @@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct ad
>         percpu_down_read(&sbi->s_journal_flag_rwsem);
>         trace_ext4_writepages(inode, wbc);
>
> -       ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
> +       ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>         trace_ext4_writepages_result(inode, wbc, ret,
>                                      nr_to_write - wbc->nr_to_write);
>         percpu_up_read(&sbi->s_journal_flag_rwsem);
> Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c    2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/ext2/inode.c 2019-08-26 16:45:29.477710196 -0400
> @@ -957,8 +957,9 @@ ext2_writepages(struct address_space *ma
>  static int
>  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> -       return dax_writeback_mapping_range(mapping,
> -                       mapping->host->i_sb->s_bdev, wbc);
> +       struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
> +
> +       return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  }
>
>  const struct address_space_operations ext2_aops = {

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-21 17:57 ` [PATCH 04/19] virtio: Implement get_shm_region for PCI transport Vivek Goyal
  2019-08-26  1:43   ` [Virtio-fs] " piaojun
@ 2019-08-27  8:34   ` Cornelia Huck
  2019-08-27  8:46     ` Cornelia Huck
  2019-08-27 11:53     ` Vivek Goyal
  1 sibling, 2 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-08-27  8:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kvm, kbuild test robot

On Wed, 21 Aug 2019 13:57:05 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> On PCI the shm regions are found using capability entries;
> find a region by searching for the capability.
> 
> Cc: kvm@vger.kernel.org
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: kbuild test robot <lkp@intel.com>

An s-o-b by a test robot looks a bit odd.

> ---
>  drivers/virtio/virtio_pci_modern.c | 108 +++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_pci.h    |  11 ++-
>  2 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 7abcc50838b8..1cdedd93f42a 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -443,6 +443,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
>  	vring_del_virtqueue(vq);
>  }
>  
> +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> +                                   u8 required_id,
> +                                   u8 *bar, u64 *offset, u64 *len)
> +{
> +	int pos;
> +
> +        for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);

Indentation looks a bit off here.

> +             pos > 0;
> +             pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> +		u8 type, cap_len, id;
> +                u32 tmp32;

Here as well.

> +                u64 res_offset, res_length;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         cfg_type),
> +                                     &type);
> +                if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)

And here.

> +                        continue;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         cap_len),
> +                                     &cap_len);
> +		if (cap_len != sizeof(struct virtio_pci_cap64)) {
> +		        printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
> +                               __func__, pos, cap_len);

Probably better to use dev_warn() instead of printk.

> +                        continue;
> +                }

Indentation looks off again (might be a space vs tabs issue; maybe
check the whole patch for indentation problems?)

> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         id),
> +                                     &id);
> +                if (id != required_id)
> +                        continue;
> +
> +                /* Type, and ID match, looks good */
> +                pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         bar),
> +                                     bar);
> +
> +                /* Read the lower 32bit of length and offset */
> +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
> +                                      &tmp32);
> +                res_offset = tmp32;
> +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
> +                                      &tmp32);
> +                res_length = tmp32;
> +
> +                /* and now the top half */
> +                pci_read_config_dword(dev,
> +                                      pos + offsetof(struct virtio_pci_cap64,
> +                                                     offset_hi),
> +                                      &tmp32);
> +                res_offset |= ((u64)tmp32) << 32;
> +                pci_read_config_dword(dev,
> +                                      pos + offsetof(struct virtio_pci_cap64,
> +                                                     length_hi),
> +                                      &tmp32);
> +                res_length |= ((u64)tmp32) << 32;
> +
> +                *offset = res_offset;
> +                *len = res_length;
> +
> +                return pos;
> +        }
> +        return 0;
> +}
> +
> +static bool vp_get_shm_region(struct virtio_device *vdev,
> +			      struct virtio_shm_region *region, u8 id)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);

This whole function looks like it is indented incorrectly.

> +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> +	u8 bar;
> +	u64 offset, len;
> +	phys_addr_t phys_addr;
> +	size_t bar_len;
> +	char *bar_name;
> +	int ret;
> +
> +	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> +		return false;
> +	}

You can drop the curly braces.

> +
> +	ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> +	if (ret < 0) {
> +		dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> +			__func__);
> +		return false;
> +	}
> +
> +	phys_addr = pci_resource_start(pci_dev, bar);
> +	bar_len = pci_resource_len(pci_dev, bar);
> +
> +        if (offset + len > bar_len) {
> +                dev_err(&pci_dev->dev,
> +                        "%s: bar shorter than cap offset+len\n",
> +                        __func__);
> +                return false;
> +        }
> +
> +	region->len = len;
> +	region->addr = (u64) phys_addr + offset;
> +
> +	return true;
> +}
> +
>  static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
>  	.get		= NULL,
>  	.set		= NULL,

Apart from the coding style nits, the logic of the patch looks sane to
me.

(...)

As does the rest of the patch.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport
  2019-08-21 17:57 ` [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
@ 2019-08-27  8:39   ` Cornelia Huck
  2019-08-27 11:54     ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Cornelia Huck @ 2019-08-27  8:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kvm

On Wed, 21 Aug 2019 13:57:06 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> On MMIO a new set of registers is defined for finding SHM
> regions.  Add their definitions and use them to find the region.
> 
> Cc: kvm@vger.kernel.org
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> ---
>  drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_mmio.h | 11 +++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index e09edb5c5e06..5c07985c8cb8 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -500,6 +500,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>  	return vm_dev->pdev->name;
>  }
>  
> +static bool vm_get_shm_region(struct virtio_device *vdev,
> +			      struct virtio_shm_region *region, u8 id)
> +{
> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> +	u64 len, addr;
> +
> +	/* Select the region we're interested in */
> +	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
> +
> +	/* Read the region size */
> +	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
> +	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
> +
> +	region->len = len;
> +
> +	/* Check if region length is -1. If that's the case, the shared memory
> +	 * region does not exist and there is no need to proceed further.
> +	 */
> +	if (len == ~(u64)0) {
> +		return false;
> +	}

I think the curly braces should be dropped here.

> +
> +	/* Read the region base address */
> +	addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
> +	addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
> +
> +	region->addr = addr;
> +
> +	return true;
> +}
> +
>  static const struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get		= vm_get,
>  	.set		= vm_set,

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-27  8:34   ` Cornelia Huck
@ 2019-08-27  8:46     ` Cornelia Huck
  2019-08-27 11:53     ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Cornelia Huck @ 2019-08-27  8:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kvm, kbuild test robot

On Tue, 27 Aug 2019 10:34:57 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 21 Aug 2019 13:57:05 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:

> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > +			      struct virtio_shm_region *region, u8 id)
> > +{
> > +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);  
> 
> This whole function looks like it is indented incorrectly.

Hmpf, it looks like my mail client is squashing tabs, so the
indentation looks off here, but is probably fine :) It's the function
above that seems to have a mix of spaces and tabs.

> 
> > +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +	u8 bar;
> > +	u64 offset, len;
> > +	phys_addr_t phys_addr;
> > +	size_t bar_len;
> > +	char *bar_name;
> > +	int ret;

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [Virtio-fs] [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-26 13:06     ` Vivek Goyal
@ 2019-08-27  9:41       ` piaojun
  0 siblings, 0 replies; 77+ messages in thread
From: piaojun @ 2019-08-27  9:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, kbuild test robot,
	kvm, miklos, virtio-fs, Sebastien Boeuf



On 2019/8/26 21:06, Vivek Goyal wrote:
> On Mon, Aug 26, 2019 at 09:43:08AM +0800, piaojun wrote:
> 
> [..]
>>> +static bool vp_get_shm_region(struct virtio_device *vdev,
>>> +			      struct virtio_shm_region *region, u8 id)
>>> +{
>>> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>> +	struct pci_dev *pci_dev = vp_dev->pci_dev;
>>> +	u8 bar;
>>> +	u64 offset, len;
>>> +	phys_addr_t phys_addr;
>>> +	size_t bar_len;
>>> +	char *bar_name;
>>
>> 'char *bar_name' should be cleaned up to avoid compiling warning. And I
>> wonder if you mix tab and blankspace for code indent? Or it's just my
>> email display problem?
> 
> Will get rid of now unused bar_name. 
> 
OK

> Generally git flags if there are tab/space issues. I did not see any. So
> if you see something, point it out and I will fix it.
> 

cohuck found the same indent problem and pointed them in another email.

Jun

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/19] virtio: Implement get_shm_region for PCI transport
  2019-08-27  8:34   ` Cornelia Huck
  2019-08-27  8:46     ` Cornelia Huck
@ 2019-08-27 11:53     ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-27 11:53 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kvm, kbuild test robot

On Tue, Aug 27, 2019 at 10:34:57AM +0200, Cornelia Huck wrote:
> On Wed, 21 Aug 2019 13:57:05 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > 
> > On PCI the shm regions are found using capability entries;
> > find a region by searching for the capability.
> > 
> > Cc: kvm@vger.kernel.org
> > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: kbuild test robot <lkp@intel.com>
> 
> An s-o-b by a test robot looks a bit odd.

I think one of the fixes came from the robot and that's why I put s-o-b
from the robot as well. 

I will review the whole patch and fix all the intendation issues.

Vivek

> 
> > ---
> >  drivers/virtio/virtio_pci_modern.c | 108 +++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_pci.h    |  11 ++-
> >  2 files changed, 118 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > index 7abcc50838b8..1cdedd93f42a 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -443,6 +443,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
> >  	vring_del_virtqueue(vq);
> >  }
> >  
> > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > +                                   u8 required_id,
> > +                                   u8 *bar, u64 *offset, u64 *len)
> > +{
> > +	int pos;
> > +
> > +        for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> 
> Indentation looks a bit off here.
> 
> > +             pos > 0;
> > +             pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> > +		u8 type, cap_len, id;
> > +                u32 tmp32;
> 
> Here as well.
> 
> > +                u64 res_offset, res_length;
> > +
> > +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> > +                                                         cfg_type),
> > +                                     &type);
> > +                if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
> 
> And here.
> 
> > +                        continue;
> > +
> > +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> > +                                                         cap_len),
> > +                                     &cap_len);
> > +		if (cap_len != sizeof(struct virtio_pci_cap64)) {
> > +		        printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
> > +                               __func__, pos, cap_len);
> 
> Probably better to use dev_warn() instead of printk.
> 
> > +                        continue;
> > +                }
> 
> Indentation looks off again (might be a space vs tabs issue; maybe
> check the whole patch for indentation problems?)
> 
> > +
> > +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> > +                                                         id),
> > +                                     &id);
> > +                if (id != required_id)
> > +                        continue;
> > +
> > +                /* Type, and ID match, looks good */
> > +                pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> > +                                                         bar),
> > +                                     bar);
> > +
> > +                /* Read the lower 32bit of length and offset */
> > +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
> > +                                      &tmp32);
> > +                res_offset = tmp32;
> > +                pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
> > +                                      &tmp32);
> > +                res_length = tmp32;
> > +
> > +                /* and now the top half */
> > +                pci_read_config_dword(dev,
> > +                                      pos + offsetof(struct virtio_pci_cap64,
> > +                                                     offset_hi),
> > +                                      &tmp32);
> > +                res_offset |= ((u64)tmp32) << 32;
> > +                pci_read_config_dword(dev,
> > +                                      pos + offsetof(struct virtio_pci_cap64,
> > +                                                     length_hi),
> > +                                      &tmp32);
> > +                res_length |= ((u64)tmp32) << 32;
> > +
> > +                *offset = res_offset;
> > +                *len = res_length;
> > +
> > +                return pos;
> > +        }
> > +        return 0;
> > +}
> > +
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > +			      struct virtio_shm_region *region, u8 id)
> > +{
> > +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> 
> This whole function looks like it is indented incorrectly.
> 
> > +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +	u8 bar;
> > +	u64 offset, len;
> > +	phys_addr_t phys_addr;
> > +	size_t bar_len;
> > +	char *bar_name;
> > +	int ret;
> > +
> > +	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > +		return false;
> > +	}
> 
> You can drop the curly braces.
> 
> > +
> > +	ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > +	if (ret < 0) {
> > +		dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > +			__func__);
> > +		return false;
> > +	}
> > +
> > +	phys_addr = pci_resource_start(pci_dev, bar);
> > +	bar_len = pci_resource_len(pci_dev, bar);
> > +
> > +        if (offset + len > bar_len) {
> > +                dev_err(&pci_dev->dev,
> > +                        "%s: bar shorter than cap offset+len\n",
> > +                        __func__);
> > +                return false;
> > +        }
> > +
> > +	region->len = len;
> > +	region->addr = (u64) phys_addr + offset;
> > +
> > +	return true;
> > +}
> > +
> >  static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
> >  	.get		= NULL,
> >  	.set		= NULL,
> 
> Apart from the coding style nits, the logic of the patch looks sane to
> me.
> 
> (...)
> 
> As does the rest of the patch.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport
  2019-08-27  8:39   ` Cornelia Huck
@ 2019-08-27 11:54     ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2019-08-27 11:54 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kvm

On Tue, Aug 27, 2019 at 10:39:43AM +0200, Cornelia Huck wrote:
> On Wed, 21 Aug 2019 13:57:06 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > 
> > On MMIO a new set of registers is defined for finding SHM
> > regions.  Add their definitions and use them to find the region.
> > 
> > Cc: kvm@vger.kernel.org
> > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > ---
> >  drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_mmio.h | 11 +++++++++++
> >  2 files changed, 43 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> > index e09edb5c5e06..5c07985c8cb8 100644
> > --- a/drivers/virtio/virtio_mmio.c
> > +++ b/drivers/virtio/virtio_mmio.c
> > @@ -500,6 +500,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
> >  	return vm_dev->pdev->name;
> >  }
> >  
> > +static bool vm_get_shm_region(struct virtio_device *vdev,
> > +			      struct virtio_shm_region *region, u8 id)
> > +{
> > +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> > +	u64 len, addr;
> > +
> > +	/* Select the region we're interested in */
> > +	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
> > +
> > +	/* Read the region size */
> > +	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
> > +	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
> > +
> > +	region->len = len;
> > +
> > +	/* Check if region length is -1. If that's the case, the shared memory
> > +	 * region does not exist and there is no need to proceed further.
> > +	 */
> > +	if (len == ~(u64)0) {
> > +		return false;
> > +	}
> 
> I think the curly braces should be dropped here.

Will do.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 20:33     ` Vivek Goyal
  2019-08-26 20:58       ` Vivek Goyal
@ 2019-08-27 13:45       ` Jan Kara
  1 sibling, 0 replies; 77+ messages in thread
From: Jan Kara @ 2019-08-27 13:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-nvdimm,
	virtio-fs, miklos, stefanha, dgilbert, Dan Williams

On Mon 26-08-19 16:33:26, Vivek Goyal wrote:
> On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > is searched from that bdev name.
> > > 
> > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > used otherwise dax_dev is searched using bdev.
> > 
> > Please just pass in only the dax_device and get rid of the block device.
> > The callers should have one at hand easily, e.g. for XFS just call
> > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> 
> Sure. Here is the updated patch.
> 
> This patch can probably go upstream independently. If you are fine with
> the patch, I can post it separately for inclusion.
> 
> 
> Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
> 
> As of now dax_writeback_mapping_range() takes "struct block_device" as a
> parameter and dax_dev is searched from bdev name. This also involves taking
> a fresh reference on dax_dev and putting that reference at the end of
> function.
> 
> We are developing a new filesystem virtio-fs and using dax to access host
> page cache directly. But there is no block device. IOW, we want to make
> use of dax but want to get rid of this assumption that there is always
> a block device associated with dax_dev.
> 
> So pass in "struct dax_device" as parameter instead of bdev.
> 
> ext2/ext4/xfs are current users and they already have a reference on
> dax_device. So there is no need to take reference and drop reference to
> dax_device on each call of this function.
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/dax.c            |    8 +-------
>  fs/ext2/inode.c     |    5 +++--
>  fs/ext4/inode.c     |    2 +-
>  fs/xfs/xfs_aops.c   |    2 +-
>  include/linux/dax.h |    2 +-
>  5 files changed, 7 insertions(+), 12 deletions(-)
> 
> Index: rhvgoyal-linux-fuse/fs/dax.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/dax.c	2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/fs/dax.c	2019-08-26 11:24:43.973009968 -0400
> @@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
>   * on persistent storage prior to completion of the operation.
>   */
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc)
> +		struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>  	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
>  	struct inode *inode = mapping->host;
>  	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
> -	struct dax_device *dax_dev;
>  	void *entry;
>  	int ret = 0;
>  	unsigned int scanned = 0;
> @@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
>  	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
>  		return 0;
>  
> -	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> -	if (!dax_dev)
> -		return -EIO;
> -
>  	trace_dax_writeback_range(inode, xas.xa_index, end_index);
>  
>  	tag_pages_for_writeback(mapping, xas.xa_index, end_index);
> @@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
>  		xas_lock_irq(&xas);
>  	}
>  	xas_unlock_irq(&xas);
> -	put_dax(dax_dev);
>  	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
>  	return ret;
>  }
> Index: rhvgoyal-linux-fuse/include/linux/dax.h
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/include/linux/dax.h	2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/include/linux/dax.h	2019-08-26 11:26:08.384009968 -0400
> @@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
>  
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc);
> +		struct dax_device *dax_dev, struct writeback_control *wbc);
>  
>  struct page *dax_layout_busy_page(struct address_space *mapping);
>  dax_entry_t dax_lock_page(struct page *page);
> Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c	2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c	2019-08-26 11:34:51.085009968 -0400
> @@ -1120,7 +1120,7 @@ xfs_dax_writepages(
>  {
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	return dax_writeback_mapping_range(mapping,
> -			xfs_find_bdev_for_inode(mapping->host), wbc);
> +			xfs_find_daxdev_for_inode(mapping->host), wbc);
>  }
>  
>  STATIC int
> Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c	2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/fs/ext4/inode.c	2019-08-26 11:39:56.828009968 -0400
> @@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct ad
>  	percpu_down_read(&sbi->s_journal_flag_rwsem);
>  	trace_ext4_writepages(inode, wbc);
>  
> -	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
> +	ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  	trace_ext4_writepages_result(inode, wbc, ret,
>  				     nr_to_write - wbc->nr_to_write);
>  	percpu_up_read(&sbi->s_journal_flag_rwsem);
> Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c	2019-08-26 11:20:36.545009968 -0400
> +++ rhvgoyal-linux-fuse/fs/ext2/inode.c	2019-08-26 11:43:04.842009968 -0400
> @@ -957,8 +957,9 @@ ext2_writepages(struct address_space *ma
>  static int
>  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> -	return dax_writeback_mapping_range(mapping,
> -			mapping->host->i_sb->s_bdev, wbc);
> +	struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
> +
> +	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  }
>  
>  const struct address_space_operations ext2_aops = {
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-26 11:51   ` Christoph Hellwig
@ 2019-08-27 16:38     ` Vivek Goyal
  2019-08-28  6:58       ` Christoph Hellwig
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-27 16:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Dan Williams

On Mon, Aug 26, 2019 at 04:51:52AM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 01:57:02PM -0400, Vivek Goyal wrote:
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Although struct dax_device itself is not tied to a block device, some
> > DAX code assumes there is a block device.  Make block devices optional
> > by allowing bdev to be NULL in commonly used DAX APIs.
> > 
> > When there is no block device:
> >  * Skip the partition offset calculation in bdev_dax_pgoff()
> >  * Skip the blkdev_issue_zeroout() optimization
> > 
> > Note that more block device assumptions remain but I haven't reach those
> > code paths yet.
> 
> I think this should be split into two patches.

Hi Christoph,

Ok, will split in two patches. In fact, I think will completley drop
the second change right now as I think we might not be hitting that
path yet.

> For bdev_dax_pgoff
> I'd much rather have the partition offset if there is on in the daxdev
> somehow so that we can get rid of the block device entirely.

IIUC, there is one block_device per partition while there is only one
dax_device for the whole disk. So we can't directly move bdev logical
offset into dax_device.

We probably could put this in "iomap" and leave it to filesystems to
report offset into dax_dev in iomap that way dax generic code does not
have to deal with it. But that probably will be a bigger change.

Did I misunderstand your suggestion.

> 
> Similarly for dax_range_is_aligned I'd rather have a pure dax way
> to offload zeroing rather than this bdev hack.

Following commig introduced the change to write zeros through block
device path.

commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a
Author: Vishal Verma <vishal.l.verma@intel.com>
Date:   Thu Apr 21 15:13:46 2016 -0400

 dax: for truncate/hole-punch, do zeroing through the driver if possible

IIUC, they are doing it so that they can clear gendisk->badblocks list.

So even if there is pure dax way to do it, there will have to some
involvment of block layer to clear gendisk->badblocks list.

I am not sure I fully understand your suggestion. But I am hoping its
not a must for these changes to make a progress. For now, I will drop
change to dax_range_is_aligned().

Thanks
Vivek

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-27 16:38     ` Vivek Goyal
@ 2019-08-28  6:58       ` Christoph Hellwig
  2019-08-28 17:58         ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Hellwig @ 2019-08-28  6:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, miklos, linux-nvdimm, linux-kernel, dgilbert,
	virtio-fs, stefanha, linux-fsdevel

On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > For bdev_dax_pgoff
> > I'd much rather have the partition offset if there is on in the daxdev
> > somehow so that we can get rid of the block device entirely.
> 
> IIUC, there is one block_device per partition while there is only one
> dax_device for the whole disk. So we can't directly move bdev logical
> offset into dax_device.

Well, then we need to find a way to get partitions for dax devices,
as we really should not expect a block device hiding behind a dax
dev.  That is just a weird legacy assumption - block device need to
layer on top of the dax device optionally.

> 
> We probably could put this in "iomap" and leave it to filesystems to
> report offset into dax_dev in iomap that way dax generic code does not
> have to deal with it. But that probably will be a bigger change.

And where would the file system get that information from?

> commit 4b0228fa1d753f77fe0e6cf4c41398ec77dfbd2a
> Author: Vishal Verma <vishal.l.verma@intel.com>
> Date:   Thu Apr 21 15:13:46 2016 -0400
> 
>  dax: for truncate/hole-punch, do zeroing through the driver if possible
> 
> IIUC, they are doing it so that they can clear gendisk->badblocks list.
> 
> So even if there is pure dax way to do it, there will have to some
> involvment of block layer to clear gendisk->badblocks list.

Once again we need to move that list to the dax device, as the
assumption that there is a block device associated with the dax dev
is flawed.

> I am not sure I fully understand your suggestion. But I am hoping its
> not a must for these changes to make a progress. For now, I will drop
> change to dax_range_is_aligned().

Well, someone needs to clean this mess up, and as you have an actual
real life example of a dax dev without the block device I think the
burden naturally falls on you.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 20:58       ` Vivek Goyal
  2019-08-26 21:33         ` Dan Williams
@ 2019-08-28  6:58         ` Christoph Hellwig
  2020-01-03 14:12         ` Vivek Goyal
  2 siblings, 0 replies; 77+ messages in thread
From: Christoph Hellwig @ 2019-08-28  6:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, miklos, linux-nvdimm, linux-kernel, dgilbert,
	virtio-fs, stefanha, linux-fsdevel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-28  6:58       ` Christoph Hellwig
@ 2019-08-28 17:58         ` Vivek Goyal
  2019-08-28 22:53           ` Dave Chinner
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-08-28 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: miklos, linux-nvdimm, linux-kernel, dgilbert, virtio-fs,
	stefanha, linux-fsdevel, Dan Williams

On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > For bdev_dax_pgoff
> > > I'd much rather have the partition offset if there is on in the daxdev
> > > somehow so that we can get rid of the block device entirely.
> > 
> > IIUC, there is one block_device per partition while there is only one
> > dax_device for the whole disk. So we can't directly move bdev logical
> > offset into dax_device.
> 
> Well, then we need to find a way to get partitions for dax devices,
> as we really should not expect a block device hiding behind a dax
> dev.  That is just a weird legacy assumption - block device need to
> layer on top of the dax device optionally.
> 
> > 
> > We probably could put this in "iomap" and leave it to filesystems to
> > report offset into dax_dev in iomap that way dax generic code does not
> > have to deal with it. But that probably will be a bigger change.
> 
> And where would the file system get that information from?

File system knows about block device, can it just call get_start_sect()
while filling iomap->addr. And this means we don't have to have
parition information in dax device. Will something like following work?
(Just a proof of concept patch).


---
 drivers/dax/super.c |   11 +++++++++++
 fs/dax.c            |    6 +++---
 fs/ext4/inode.c     |    6 +++++-
 include/linux/dax.h |    1 +
 4 files changed, 20 insertions(+), 4 deletions(-)

Index: rhvgoyal-linux/fs/ext4/inode.c
===================================================================
--- rhvgoyal-linux.orig/fs/ext4/inode.c	2019-08-28 13:51:16.051937204 -0400
+++ rhvgoyal-linux/fs/ext4/inode.c	2019-08-28 13:51:44.453937204 -0400
@@ -3589,7 +3589,11 @@ retry:
 			WARN_ON_ONCE(1);
 			return -EIO;
 		}
-		iomap->addr = (u64)map.m_pblk << blkbits;
+		if (IS_DAX(inode))
+			iomap->addr = ((u64)map.m_pblk << blkbits) +
+				      (get_start_sect(iomap->bdev) * 512);
+		else
+			iomap->addr = (u64)map.m_pblk << blkbits;
 	}
 
 	if (map.m_flags & EXT4_MAP_NEW)
Index: rhvgoyal-linux/fs/dax.c
===================================================================
--- rhvgoyal-linux.orig/fs/dax.c	2019-08-28 13:51:16.051937204 -0400
+++ rhvgoyal-linux/fs/dax.c	2019-08-28 13:51:44.457937204 -0400
@@ -688,7 +688,7 @@ static int copy_user_dax(struct block_de
 	long rc;
 	int id;
 
-	rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+	rc = dax_pgoff(sector, size, &pgoff);
 	if (rc)
 		return rc;
 
@@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *i
 	int id, rc;
 	long length;
 
-	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
+	rc = dax_pgoff(sector, size, &pgoff);
 	if (rc)
 		return rc;
 	id = dax_read_lock();
@@ -1137,7 +1137,7 @@ dax_iomap_actor(struct inode *inode, lof
 			break;
 		}
 
-		ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+		ret = dax_pgoff(sector, size, &pgoff);
 		if (ret)
 			break;
 
Index: rhvgoyal-linux/drivers/dax/super.c
===================================================================
--- rhvgoyal-linux.orig/drivers/dax/super.c	2019-08-28 13:51:51.802937204 -0400
+++ rhvgoyal-linux/drivers/dax/super.c	2019-08-28 13:51:56.905937204 -0400
@@ -43,6 +43,17 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 #ifdef CONFIG_BLOCK
 #include <linux/blkdev.h>
 
+int dax_pgoff(sector_t sector, size_t size, pgoff_t *pgoff)
+{
+	phys_addr_t phys_off = sector * 512;
+
+	if (pgoff)
+		*pgoff = PHYS_PFN(phys_off);
+	if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
+		return -EINVAL;
+	return 0;
+}
+
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
Index: rhvgoyal-linux/include/linux/dax.h
===================================================================
--- rhvgoyal-linux.orig/include/linux/dax.h	2019-08-28 13:51:51.802937204 -0400
+++ rhvgoyal-linux/include/linux/dax.h	2019-08-28 13:51:56.908937204 -0400
@@ -111,6 +111,7 @@ static inline bool daxdev_mapping_suppor
 
 struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
+int dax_pgoff(sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 bool __bdev_dax_supported(struct block_device *bdev, int blocksize);
 static inline bool bdev_dax_supported(struct block_device *bdev, int blocksize)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-28 17:58         ` Vivek Goyal
@ 2019-08-28 22:53           ` Dave Chinner
  2019-08-29  0:04             ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2019-08-28 22:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Christoph Hellwig, miklos, linux-nvdimm, linux-kernel, dgilbert,
	virtio-fs, stefanha, linux-fsdevel, Dan Williams

On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > For bdev_dax_pgoff
> > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > somehow so that we can get rid of the block device entirely.
> > > 
> > > IIUC, there is one block_device per partition while there is only one
> > > dax_device for the whole disk. So we can't directly move bdev logical
> > > offset into dax_device.
> > 
> > Well, then we need to find a way to get partitions for dax devices,
> > as we really should not expect a block device hiding behind a dax
> > dev.  That is just a weird legacy assumption - block device need to
> > layer on top of the dax device optionally.
> > 
> > > 
> > > We probably could put this in "iomap" and leave it to filesystems to
> > > report offset into dax_dev in iomap that way dax generic code does not
> > > have to deal with it. But that probably will be a bigger change.
> > 
> > And where would the file system get that information from?
> 
> File system knows about block device, can it just call get_start_sect()
> while filling iomap->addr. And this means we don't have to have
> parition information in dax device. Will something like following work?
> (Just a proof of concept patch).
> 
> 
> ---
>  drivers/dax/super.c |   11 +++++++++++
>  fs/dax.c            |    6 +++---
>  fs/ext4/inode.c     |    6 +++++-
>  include/linux/dax.h |    1 +
>  4 files changed, 20 insertions(+), 4 deletions(-)
> 
> Index: rhvgoyal-linux/fs/ext4/inode.c
> ===================================================================
> --- rhvgoyal-linux.orig/fs/ext4/inode.c	2019-08-28 13:51:16.051937204 -0400
> +++ rhvgoyal-linux/fs/ext4/inode.c	2019-08-28 13:51:44.453937204 -0400
> @@ -3589,7 +3589,11 @@ retry:
>  			WARN_ON_ONCE(1);
>  			return -EIO;
>  		}
> -		iomap->addr = (u64)map.m_pblk << blkbits;
> +		if (IS_DAX(inode))
> +			iomap->addr = ((u64)map.m_pblk << blkbits) +
> +				      (get_start_sect(iomap->bdev) * 512);
> +		else
> +			iomap->addr = (u64)map.m_pblk << blkbits;

I'm not a fan of returning a physical device sector address from an
interface where ever other user/caller expects this address to be a
logical block address into the block device. It creates a landmine
in the iomap API that callers may not be aware of and that's going
to cause bugs. We're trying really hard to keep special case hacks
like this out of the iomap infrastructure, so on those grounds alone
I'd suggest this is a dead end approach.

Hence I think that if the dax device needs a physical offset from
the start of the block device the filesystem sits on, it should be
set up at dax device instantiation time and so the filesystem/bdev
never needs to be queried again for this information.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-28 22:53           ` Dave Chinner
@ 2019-08-29  0:04             ` Dan Williams
  2019-08-29  9:32               ` Christoph Hellwig
  2019-12-16 18:10               ` Vivek Goyal
  0 siblings, 2 replies; 77+ messages in thread
From: Dan Williams @ 2019-08-29  0:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vivek Goyal, Christoph Hellwig, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > > For bdev_dax_pgoff
> > > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > > somehow so that we can get rid of the block device entirely.
> > > >
> > > > IIUC, there is one block_device per partition while there is only one
> > > > dax_device for the whole disk. So we can't directly move bdev logical
> > > > offset into dax_device.
> > >
> > > Well, then we need to find a way to get partitions for dax devices,
> > > as we really should not expect a block device hiding behind a dax
> > > dev.  That is just a weird legacy assumption - block device need to
> > > layer on top of the dax device optionally.
> > >
> > > >
> > > > We probably could put this in "iomap" and leave it to filesystems to
> > > > report offset into dax_dev in iomap that way dax generic code does not
> > > > have to deal with it. But that probably will be a bigger change.
> > >
> > > And where would the file system get that information from?
> >
> > File system knows about block device, can it just call get_start_sect()
> > while filling iomap->addr. And this means we don't have to have
> > parition information in dax device. Will something like following work?
> > (Just a proof of concept patch).
> >
> >
> > ---
> >  drivers/dax/super.c |   11 +++++++++++
> >  fs/dax.c            |    6 +++---
> >  fs/ext4/inode.c     |    6 +++++-
> >  include/linux/dax.h |    1 +
> >  4 files changed, 20 insertions(+), 4 deletions(-)
> >
> > Index: rhvgoyal-linux/fs/ext4/inode.c
> > ===================================================================
> > --- rhvgoyal-linux.orig/fs/ext4/inode.c       2019-08-28 13:51:16.051937204 -0400
> > +++ rhvgoyal-linux/fs/ext4/inode.c    2019-08-28 13:51:44.453937204 -0400
> > @@ -3589,7 +3589,11 @@ retry:
> >                       WARN_ON_ONCE(1);
> >                       return -EIO;
> >               }
> > -             iomap->addr = (u64)map.m_pblk << blkbits;
> > +             if (IS_DAX(inode))
> > +                     iomap->addr = ((u64)map.m_pblk << blkbits) +
> > +                                   (get_start_sect(iomap->bdev) * 512);
> > +             else
> > +                     iomap->addr = (u64)map.m_pblk << blkbits;
>
> I'm not a fan of returning a physical device sector address from an
> interface where ever other user/caller expects this address to be a
> logical block address into the block device. It creates a landmine
> in the iomap API that callers may not be aware of and that's going
> to cause bugs. We're trying really hard to keep special case hacks
> like this out of the iomap infrastructure, so on those grounds alone
> I'd suggest this is a dead end approach.
>
> Hence I think that if the dax device needs a physical offset from
> the start of the block device the filesystem sits on, it should be
> set up at dax device instantiation time and so the filesystem/bdev
> never needs to be queried again for this information.
>

Agree. In retrospect it was my laziness in the dax-device
implementation to expect the block-device to be available.

It looks like fs_dax_get_by_bdev() is an intercept point where a
dax_device could be dynamically created to represent the subset range
indicated by the block-device partition. That would open up more
cleanup opportunities.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-29  0:04             ` Dan Williams
@ 2019-08-29  9:32               ` Christoph Hellwig
  2019-12-16 18:10               ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Hellwig @ 2019-08-29  9:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Vivek Goyal, Christoph Hellwig, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote:
> Agree. In retrospect it was my laziness in the dax-device
> implementation to expect the block-device to be available.
> 
> It looks like fs_dax_get_by_bdev() is an intercept point where a
> dax_device could be dynamically created to represent the subset range
> indicated by the block-device partition. That would open up more
> cleanup opportunities.

That seems like a decent short-term plan.  But in the long I'd just let
dax call into the partition table parser directly, as we might want to
support partitions without first having to create a block device on top
of the dax device.  Same for the badblocks handling story.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-08-29  0:04             ` Dan Williams
  2019-08-29  9:32               ` Christoph Hellwig
@ 2019-12-16 18:10               ` Vivek Goyal
  2020-01-07 12:51                 ` Christoph Hellwig
  1 sibling, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2019-12-16 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Christoph Hellwig, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Wed, Aug 28, 2019 at 05:04:11PM -0700, Dan Williams wrote:
> On Wed, Aug 28, 2019 at 3:53 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> > > On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > > > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > > > For bdev_dax_pgoff
> > > > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > > > somehow so that we can get rid of the block device entirely.
> > > > >
> > > > > IIUC, there is one block_device per partition while there is only one
> > > > > dax_device for the whole disk. So we can't directly move bdev logical
> > > > > offset into dax_device.
> > > >
> > > > Well, then we need to find a way to get partitions for dax devices,
> > > > as we really should not expect a block device hiding behind a dax
> > > > dev.  That is just a weird legacy assumption - block device need to
> > > > layer on top of the dax device optionally.
> > > >
> > > > >
> > > > > We probably could put this in "iomap" and leave it to filesystems to
> > > > > report offset into dax_dev in iomap that way dax generic code does not
> > > > > have to deal with it. But that probably will be a bigger change.
> > > >
> > > > And where would the file system get that information from?
> > >
> > > File system knows about block device, can it just call get_start_sect()
> > > while filling iomap->addr. And this means we don't have to have
> > > parition information in dax device. Will something like following work?
> > > (Just a proof of concept patch).
> > >
> > >
> > > ---
> > >  drivers/dax/super.c |   11 +++++++++++
> > >  fs/dax.c            |    6 +++---
> > >  fs/ext4/inode.c     |    6 +++++-
> > >  include/linux/dax.h |    1 +
> > >  4 files changed, 20 insertions(+), 4 deletions(-)
> > >
> > > Index: rhvgoyal-linux/fs/ext4/inode.c
> > > ===================================================================
> > > --- rhvgoyal-linux.orig/fs/ext4/inode.c       2019-08-28 13:51:16.051937204 -0400
> > > +++ rhvgoyal-linux/fs/ext4/inode.c    2019-08-28 13:51:44.453937204 -0400
> > > @@ -3589,7 +3589,11 @@ retry:
> > >                       WARN_ON_ONCE(1);
> > >                       return -EIO;
> > >               }
> > > -             iomap->addr = (u64)map.m_pblk << blkbits;
> > > +             if (IS_DAX(inode))
> > > +                     iomap->addr = ((u64)map.m_pblk << blkbits) +
> > > +                                   (get_start_sect(iomap->bdev) * 512);
> > > +             else
> > > +                     iomap->addr = (u64)map.m_pblk << blkbits;
> >
> > I'm not a fan of returning a physical device sector address from an
> > interface where ever other user/caller expects this address to be a
> > logical block address into the block device. It creates a landmine
> > in the iomap API that callers may not be aware of and that's going
> > to cause bugs. We're trying really hard to keep special case hacks
> > like this out of the iomap infrastructure, so on those grounds alone
> > I'd suggest this is a dead end approach.
> >
> > Hence I think that if the dax device needs a physical offset from
> > the start of the block device the filesystem sits on, it should be
> > set up at dax device instantiation time and so the filesystem/bdev
> > never needs to be queried again for this information.
> >
> 
> Agree. In retrospect it was my laziness in the dax-device
> implementation to expect the block-device to be available.
> 
> It looks like fs_dax_get_by_bdev() is an intercept point where a
> dax_device could be dynamically created to represent the subset range
> indicated by the block-device partition. That would open up more
> cleanup opportunities.

Hi Dan,

After a long time I got time to look at it again. Want to work on this
cleanup so that I can make progress with virtiofs DAX paches.

I am not sure I understand the requirements fully. I see that right now
dax_device is created per device and all block partitions refer to it. If
we want to create one dax_device per partition, then it looks like this
will be structured more along the lines how block layer handles disk and
partitions. (One gendisk for disk and block_devices for partitions,
including partition 0). That probably means state belong to whole device
will be in common structure say dax_device_common, and per partition state
will be in dax_device and dax_device can carry a pointer to
dax_device_common.

I am also not sure what does it mean to partition dax devices. How will
partitions be exported to user space.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2019-08-26 20:58       ` Vivek Goyal
  2019-08-26 21:33         ` Dan Williams
  2019-08-28  6:58         ` Christoph Hellwig
@ 2020-01-03 14:12         ` Vivek Goyal
  2020-01-03 18:12           ` Dan Williams
  2 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-03 14:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, virtio-fs, miklos,
	stefanha, dgilbert, Christoph Hellwig

On Mon, Aug 26, 2019 at 04:58:29PM -0400, Vivek Goyal wrote:
> On Mon, Aug 26, 2019 at 04:33:26PM -0400, Vivek Goyal wrote:
> > On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > > is searched from that bdev name.
> > > > 
> > > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > > used otherwise dax_dev is searched using bdev.
> > > 
> > > Please just pass in only the dax_device and get rid of the block device.
> > > The callers should have one at hand easily, e.g. for XFS just call
> > > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> > 
> > Sure. Here is the updated patch.
> > 
> > This patch can probably go upstream independently. If you are fine with
> > the patch, I can post it separately for inclusion.
> 
> Forgot to update function declaration in case of !CONFIG_FS_DAX. Here is
> the updated patch.
> 
> Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
> 
> As of now dax_writeback_mapping_range() takes "struct block_device" as a
> parameter and dax_dev is searched from bdev name. This also involves taking
> a fresh reference on dax_dev and putting that reference at the end of
> function.
> 
> We are developing a new filesystem virtio-fs and using dax to access host
> page cache directly. But there is no block device. IOW, we want to make
> use of dax but want to get rid of this assumption that there is always
> a block device associated with dax_dev.
> 
> So pass in "struct dax_device" as parameter instead of bdev.
> 
> ext2/ext4/xfs are current users and they already have a reference on
> dax_device. So there is no need to take reference and drop reference to
> dax_device on each call of this function.
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Hi Dan,

Ping for this patch. I see christoph and Jan acked it. Can we take it. Not
sure how to get ack from ext4 developers.

Thanks
Vivek

> ---
>  fs/dax.c            |    8 +-------
>  fs/ext2/inode.c     |    5 +++--
>  fs/ext4/inode.c     |    2 +-
>  fs/xfs/xfs_aops.c   |    2 +-
>  include/linux/dax.h |    4 ++--
>  5 files changed, 8 insertions(+), 13 deletions(-)
> 
> Index: rhvgoyal-linux-fuse/fs/dax.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/dax.c	2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/dax.c	2019-08-26 16:45:29.462710196 -0400
> @@ -936,12 +936,11 @@ static int dax_writeback_one(struct xa_s
>   * on persistent storage prior to completion of the operation.
>   */
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc)
> +		struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>  	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
>  	struct inode *inode = mapping->host;
>  	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
> -	struct dax_device *dax_dev;
>  	void *entry;
>  	int ret = 0;
>  	unsigned int scanned = 0;
> @@ -952,10 +951,6 @@ int dax_writeback_mapping_range(struct a
>  	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
>  		return 0;
>  
> -	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> -	if (!dax_dev)
> -		return -EIO;
> -
>  	trace_dax_writeback_range(inode, xas.xa_index, end_index);
>  
>  	tag_pages_for_writeback(mapping, xas.xa_index, end_index);
> @@ -976,7 +971,6 @@ int dax_writeback_mapping_range(struct a
>  		xas_lock_irq(&xas);
>  	}
>  	xas_unlock_irq(&xas);
> -	put_dax(dax_dev);
>  	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
>  	return ret;
>  }
> Index: rhvgoyal-linux-fuse/include/linux/dax.h
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/include/linux/dax.h	2019-08-26 16:45:26.094710196 -0400
> +++ rhvgoyal-linux-fuse/include/linux/dax.h	2019-08-26 16:46:08.101710196 -0400
> @@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
>  
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc);
> +		struct dax_device *dax_dev, struct writeback_control *wbc);
>  
>  struct page *dax_layout_busy_page(struct address_space *mapping);
>  dax_entry_t dax_lock_page(struct page *page);
> @@ -180,7 +180,7 @@ static inline struct page *dax_layout_bu
>  }
>  
>  static inline int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc)
> +		struct dax_device *dax_dev, struct writeback_control *wbc)
>  {
>  	return -EOPNOTSUPP;
>  }
> Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c	2019-08-26 16:45:26.094710196 -0400
> +++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c	2019-08-26 16:45:29.471710196 -0400
> @@ -1120,7 +1120,7 @@ xfs_dax_writepages(
>  {
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	return dax_writeback_mapping_range(mapping,
> -			xfs_find_bdev_for_inode(mapping->host), wbc);
> +			xfs_find_daxdev_for_inode(mapping->host), wbc);
>  }
>  
>  STATIC int
> Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c	2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/ext4/inode.c	2019-08-26 16:45:29.475710196 -0400
> @@ -2992,7 +2992,7 @@ static int ext4_dax_writepages(struct ad
>  	percpu_down_read(&sbi->s_journal_flag_rwsem);
>  	trace_ext4_writepages(inode, wbc);
>  
> -	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
> +	ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  	trace_ext4_writepages_result(inode, wbc, ret,
>  				     nr_to_write - wbc->nr_to_write);
>  	percpu_up_read(&sbi->s_journal_flag_rwsem);
> Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
> ===================================================================
> --- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c	2019-08-26 16:45:26.093710196 -0400
> +++ rhvgoyal-linux-fuse/fs/ext2/inode.c	2019-08-26 16:45:29.477710196 -0400
> @@ -957,8 +957,9 @@ ext2_writepages(struct address_space *ma
>  static int
>  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> -	return dax_writeback_mapping_range(mapping,
> -			mapping->host->i_sb->s_bdev, wbc);
> +	struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
> +
> +	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  }
>  
>  const struct address_space_operations ext2_aops = {


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2020-01-03 14:12         ` Vivek Goyal
@ 2020-01-03 18:12           ` Dan Williams
  2020-01-03 18:18             ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-03 18:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Linux Kernel Mailing List, linux-nvdimm,
	virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Christoph Hellwig

On Fri, Jan 3, 2020 at 6:12 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Aug 26, 2019 at 04:58:29PM -0400, Vivek Goyal wrote:
> > On Mon, Aug 26, 2019 at 04:33:26PM -0400, Vivek Goyal wrote:
> > > On Mon, Aug 26, 2019 at 04:53:16AM -0700, Christoph Hellwig wrote:
> > > > On Wed, Aug 21, 2019 at 01:57:03PM -0400, Vivek Goyal wrote:
> > > > > Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
> > > > > is searched from that bdev name.
> > > > >
> > > > > virtio-fs does not have a bdev. So pass in dax_dev also to
> > > > > dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
> > > > > used otherwise dax_dev is searched using bdev.
> > > >
> > > > Please just pass in only the dax_device and get rid of the block device.
> > > > The callers should have one at hand easily, e.g. for XFS just call
> > > > xfs_find_daxdev_for_inode instead of xfs_find_bdev_for_inode.
> > >
> > > Sure. Here is the updated patch.
> > >
> > > This patch can probably go upstream independently. If you are fine with
> > > the patch, I can post it separately for inclusion.
> >
> > Forgot to update function declaration in case of !CONFIG_FS_DAX. Here is
> > the updated patch.
> >
> > Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
> >
> > As of now dax_writeback_mapping_range() takes "struct block_device" as a
> > parameter and dax_dev is searched from bdev name. This also involves taking
> > a fresh reference on dax_dev and putting that reference at the end of
> > function.
> >
> > We are developing a new filesystem virtio-fs and using dax to access host
> > page cache directly. But there is no block device. IOW, we want to make
> > use of dax but want to get rid of this assumption that there is always
> > a block device associated with dax_dev.
> >
> > So pass in "struct dax_device" as parameter instead of bdev.
> >
> > ext2/ext4/xfs are current users and they already have a reference on
> > dax_device. So there is no need to take reference and drop reference to
> > dax_device on each call of this function.
> >
> > Suggested-by: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> Hi Dan,
>
> Ping for this patch. I see christoph and Jan acked it. Can we take it. Not
> sure how to get ack from ext4 developers.

Jan counts for ext4, I just missed this. Now merged.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2020-01-03 18:12           ` Dan Williams
@ 2020-01-03 18:18             ` Dan Williams
  2020-01-03 18:33               ` Vivek Goyal
  2020-01-03 18:43               ` Vivek Goyal
  0 siblings, 2 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-03 18:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Linux Kernel Mailing List, linux-nvdimm,
	virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Christoph Hellwig

On Fri, Jan 3, 2020 at 10:12 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Jan 3, 2020 at 6:12 AM Vivek Goyal <vgoyal@redhat.com> wrote:
[..]
> > Hi Dan,
> >
> > Ping for this patch. I see christoph and Jan acked it. Can we take it. Not
> > sure how to get ack from ext4 developers.
>
> Jan counts for ext4, I just missed this. Now merged.

Oh, this now collides with:

   30fa529e3b2e xfs: add a xfs_inode_buftarg helper

Care to rebase? I'll also circle back to your question about
partitions on patch1.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2020-01-03 18:18             ` Dan Williams
@ 2020-01-03 18:33               ` Vivek Goyal
  2020-01-03 19:30                 ` Dan Williams
  2020-01-03 18:43               ` Vivek Goyal
  1 sibling, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-03 18:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, Linux Kernel Mailing List, linux-nvdimm,
	virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Christoph Hellwig

On Fri, Jan 03, 2020 at 10:18:22AM -0800, Dan Williams wrote:
> On Fri, Jan 3, 2020 at 10:12 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Jan 3, 2020 at 6:12 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> [..]
> > > Hi Dan,
> > >
> > > Ping for this patch. I see christoph and Jan acked it. Can we take it. Not
> > > sure how to get ack from ext4 developers.
> >
> > Jan counts for ext4, I just missed this. Now merged.
> 
> Oh, this now collides with:
> 
>    30fa529e3b2e xfs: add a xfs_inode_buftarg helper
> 
> Care to rebase? I'll also circle back to your question about
> partitions on patch1.

Hi Dan,

Here is the updated patch.

Thanks
Vivek

Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()

As of now dax_writeback_mapping_range() takes "struct block_device" as a
parameter and dax_dev is searched from bdev name. This also involves taking
a fresh reference on dax_dev and putting that reference at the end of
function.

We are developing a new filesystem virtio-fs and using dax to access host
page cache directly. But there is no block device. IOW, we want to make
use of dax but want to get rid of this assumption that there is always
a block device associated with dax_dev.

So pass in "struct dax_device" as parameter instead of bdev.

ext2/ext4/xfs are current users and they already have a reference on
dax_device. So there is no need to take reference and drop reference to
dax_device on each call of this function.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            |    8 +-------
 fs/ext2/inode.c     |    5 +++--
 fs/ext4/inode.c     |    2 +-
 fs/xfs/xfs_aops.c   |    2 +-
 include/linux/dax.h |    4 ++--
 5 files changed, 8 insertions(+), 13 deletions(-)

Index: rhvgoyal-linux-fuse/fs/dax.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/dax.c	2020-01-03 11:19:59.151186062 -0500
+++ rhvgoyal-linux-fuse/fs/dax.c	2020-01-03 11:20:05.602186062 -0500
@@ -937,12 +937,11 @@ static int dax_writeback_one(struct xa_s
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
 	XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
 	struct inode *inode = mapping->host;
 	pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-	struct dax_device *dax_dev;
 	void *entry;
 	int ret = 0;
 	unsigned int scanned = 0;
@@ -953,10 +952,6 @@ int dax_writeback_mapping_range(struct a
 	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
 		return 0;
 
-	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-	if (!dax_dev)
-		return -EIO;
-
 	trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
 	tag_pages_for_writeback(mapping, xas.xa_index, end_index);
@@ -977,7 +972,6 @@ int dax_writeback_mapping_range(struct a
 		xas_lock_irq(&xas);
 	}
 	xas_unlock_irq(&xas);
-	put_dax(dax_dev);
 	trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
 	return ret;
 }
Index: rhvgoyal-linux-fuse/include/linux/dax.h
===================================================================
--- rhvgoyal-linux-fuse.orig/include/linux/dax.h	2020-01-03 11:19:59.151186062 -0500
+++ rhvgoyal-linux-fuse/include/linux/dax.h	2020-01-03 11:20:05.603186062 -0500
@@ -141,7 +141,7 @@ static inline void fs_put_dax(struct dax
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
+		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
@@ -180,7 +180,7 @@ static inline struct page *dax_layout_bu
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc)
+		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
 	return -EOPNOTSUPP;
 }
Index: rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/xfs/xfs_aops.c	2020-01-03 11:19:59.151186062 -0500
+++ rhvgoyal-linux-fuse/fs/xfs/xfs_aops.c	2020-01-03 11:20:05.605186062 -0500
@@ -587,7 +587,7 @@ xfs_dax_writepages(
 
 	xfs_iflags_clear(ip, XFS_ITRUNCATED);
 	return dax_writeback_mapping_range(mapping,
-			xfs_inode_buftarg(ip)->bt_bdev, wbc);
+			xfs_inode_buftarg(ip)->bt_daxdev, wbc);
 }
 
 STATIC sector_t
Index: rhvgoyal-linux-fuse/fs/ext4/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext4/inode.c	2020-01-03 11:19:59.151186062 -0500
+++ rhvgoyal-linux-fuse/fs/ext4/inode.c	2020-01-03 11:20:05.606186062 -0500
@@ -2866,7 +2866,7 @@ static int ext4_dax_writepages(struct ad
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	ret = dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 	trace_ext4_writepages_result(inode, wbc, ret,
 				     nr_to_write - wbc->nr_to_write);
 	percpu_up_read(&sbi->s_journal_flag_rwsem);
Index: rhvgoyal-linux-fuse/fs/ext2/inode.c
===================================================================
--- rhvgoyal-linux-fuse.orig/fs/ext2/inode.c	2020-01-03 11:19:59.151186062 -0500
+++ rhvgoyal-linux-fuse/fs/ext2/inode.c	2020-01-03 11:20:05.608186062 -0500
@@ -960,8 +960,9 @@ ext2_writepages(struct address_space *ma
 static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-	return dax_writeback_mapping_range(mapping,
-			mapping->host->i_sb->s_bdev, wbc);
+	struct ext2_sb_info *sbi = EXT2_SB(mapping->host->i_sb);
+
+	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 }
 
 const struct address_space_operations ext2_aops = {


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2020-01-03 18:18             ` Dan Williams
  2020-01-03 18:33               ` Vivek Goyal
@ 2020-01-03 18:43               ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2020-01-03 18:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, Linux Kernel Mailing List, linux-nvdimm,
	virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Christoph Hellwig

On Fri, Jan 03, 2020 at 10:18:22AM -0800, Dan Williams wrote:

> I'll also circle back to your question about
> partitions on patch1.

Hi Dan,

I was playing with having sector information in dax device (and not having
to look back at bdev). I was thinking of something as follows.

- Create a new structure/handle which also contains offset into dax device
  in sectors. Say.

  struct dax_handle {
  	sector_t start_sect;
  	struct dax_device *dax_dev;
  }

 This handle will have pointer to the actual dax device.

- Modify dax_get_by_bdev(struct block_device *bdev) to return dax_handle
  (instead of dax device).

  struct dax_handle *dax_get_by_bdev(struct block_device *bdev);

  This will create dax_handle. Find dax_device from hash table and
  initialize dax_handle.

  dax_handle->start_sect = get_start_sect(bdev);
  dax_handle->dax_dev = dax_dev;

  Now filesystem and stacked block devices can get pointer to dax_handle
  using block device and they can use this handle to refer to underlying
  dax device partition.

- Now dax_handle can be passed around and hopefully we can get rid of
  passing around bdev in many of the dax interfaces. And partition offset
  information has now moved into dax_handle.

- For the use cases which don't have a bdev (like virtiofs), we can
  provide another helper to get dax_handle with offset 0. And then
  we should not need a bdev to be able to use dax API.

Does this sound like a reasonable step in the direction of getting rid
of this assumption that every dax_device has associated block_device.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range()
  2020-01-03 18:33               ` Vivek Goyal
@ 2020-01-03 19:30                 ` Dan Williams
  0 siblings, 0 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-03 19:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Linux Kernel Mailing List, linux-nvdimm,
	virtio-fs, Miklos Szeredi, Stefan Hajnoczi,
	Dr. David Alan Gilbert, Christoph Hellwig

On Fri, Jan 3, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jan 03, 2020 at 10:18:22AM -0800, Dan Williams wrote:
> > On Fri, Jan 3, 2020 at 10:12 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Fri, Jan 3, 2020 at 6:12 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > [..]
> > > > Hi Dan,
> > > >
> > > > Ping for this patch. I see christoph and Jan acked it. Can we take it. Not
> > > > sure how to get ack from ext4 developers.
> > >
> > > Jan counts for ext4, I just missed this. Now merged.
> >
> > Oh, this now collides with:
> >
> >    30fa529e3b2e xfs: add a xfs_inode_buftarg helper
> >
> > Care to rebase? I'll also circle back to your question about
> > partitions on patch1.
>
> Hi Dan,
>
> Here is the updated patch.
>
> Thanks
> Vivek
>
> Subject: dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()

Looks good, applied.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2019-12-16 18:10               ` Vivek Goyal
@ 2020-01-07 12:51                 ` Christoph Hellwig
  2020-01-07 14:22                   ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Hellwig @ 2020-01-07 12:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dan Williams, Dave Chinner, Christoph Hellwig, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > Agree. In retrospect it was my laziness in the dax-device
> > implementation to expect the block-device to be available.
> > 
> > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > dax_device could be dynamically created to represent the subset range
> > indicated by the block-device partition. That would open up more
> > cleanup opportunities.
> 
> Hi Dan,
> 
> After a long time I got time to look at it again. Want to work on this
> cleanup so that I can make progress with virtiofs DAX paches.
> 
> I am not sure I understand the requirements fully. I see that right now
> dax_device is created per device and all block partitions refer to it. If
> we want to create one dax_device per partition, then it looks like this
> will be structured more along the lines how block layer handles disk and
> partitions. (One gendisk for disk and block_devices for partitions,
> including partition 0). That probably means state belong to whole device
> will be in common structure say dax_device_common, and per partition state
> will be in dax_device and dax_device can carry a pointer to
> dax_device_common.
> 
> I am also not sure what does it mean to partition dax devices. How will
> partitions be exported to user space.

Dan, last time we talked you agreed that partitioned dax devices are
rather pointless IIRC.  Should we just deprecate partitions on DAX
devices and then remove them after a cycle or two?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 12:51                 ` Christoph Hellwig
@ 2020-01-07 14:22                   ` Dan Williams
  2020-01-07 17:07                     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-07 14:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > Agree. In retrospect it was my laziness in the dax-device
> > > implementation to expect the block-device to be available.
> > >
> > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > dax_device could be dynamically created to represent the subset range
> > > indicated by the block-device partition. That would open up more
> > > cleanup opportunities.
> >
> > Hi Dan,
> >
> > After a long time I got time to look at it again. Want to work on this
> > cleanup so that I can make progress with virtiofs DAX paches.
> >
> > I am not sure I understand the requirements fully. I see that right now
> > dax_device is created per device and all block partitions refer to it. If
> > we want to create one dax_device per partition, then it looks like this
> > will be structured more along the lines how block layer handles disk and
> > partitions. (One gendisk for disk and block_devices for partitions,
> > including partition 0). That probably means state belong to whole device
> > will be in common structure say dax_device_common, and per partition state
> > will be in dax_device and dax_device can carry a pointer to
> > dax_device_common.
> >
> > I am also not sure what does it mean to partition dax devices. How will
> > partitions be exported to user space.
>
> Dan, last time we talked you agreed that partitioned dax devices are
> rather pointless IIRC.  Should we just deprecate partitions on DAX
> devices and then remove them after a cycle or two?

That does seem a better plan than trying to force partition support
where it is not needed.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 14:22                   ` Dan Williams
@ 2020-01-07 17:07                     ` Darrick J. Wong
  2020-01-07 17:29                       ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2020-01-07 17:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Vivek Goyal, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > Agree. In retrospect it was my laziness in the dax-device
> > > > implementation to expect the block-device to be available.
> > > >
> > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > dax_device could be dynamically created to represent the subset range
> > > > indicated by the block-device partition. That would open up more
> > > > cleanup opportunities.
> > >
> > > Hi Dan,
> > >
> > > After a long time I got time to look at it again. Want to work on this
> > > cleanup so that I can make progress with virtiofs DAX paches.
> > >
> > > I am not sure I understand the requirements fully. I see that right now
> > > dax_device is created per device and all block partitions refer to it. If
> > > we want to create one dax_device per partition, then it looks like this
> > > will be structured more along the lines how block layer handles disk and
> > > partitions. (One gendisk for disk and block_devices for partitions,
> > > including partition 0). That probably means state belong to whole device
> > > will be in common structure say dax_device_common, and per partition state
> > > will be in dax_device and dax_device can carry a pointer to
> > > dax_device_common.
> > >
> > > I am also not sure what does it mean to partition dax devices. How will
> > > partitions be exported to user space.
> >
> > Dan, last time we talked you agreed that partitioned dax devices are
> > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > devices and then remove them after a cycle or two?
> 
> That does seem a better plan than trying to force partition support
> where it is not needed.

Question: if one /did/ have a partitioned DAX device and used kpartx to
create dm-linear devices for each partition, will DAX still work through
that?

--D

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 17:07                     ` Darrick J. Wong
@ 2020-01-07 17:29                       ` Dan Williams
  2020-01-07 18:01                         ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-07 17:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Vivek Goyal, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > implementation to expect the block-device to be available.
> > > > >
> > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > dax_device could be dynamically created to represent the subset range
> > > > > indicated by the block-device partition. That would open up more
> > > > > cleanup opportunities.
> > > >
> > > > Hi Dan,
> > > >
> > > > After a long time I got time to look at it again. Want to work on this
> > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > >
> > > > I am not sure I understand the requirements fully. I see that right now
> > > > dax_device is created per device and all block partitions refer to it. If
> > > > we want to create one dax_device per partition, then it looks like this
> > > > will be structured more along the lines how block layer handles disk and
> > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > including partition 0). That probably means state belong to whole device
> > > > will be in common structure say dax_device_common, and per partition state
> > > > will be in dax_device and dax_device can carry a pointer to
> > > > dax_device_common.
> > > >
> > > > I am also not sure what does it mean to partition dax devices. How will
> > > > partitions be exported to user space.
> > >
> > > Dan, last time we talked you agreed that partitioned dax devices are
> > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > devices and then remove them after a cycle or two?
> >
> > That does seem a better plan than trying to force partition support
> > where it is not needed.
>
> Question: if one /did/ have a partitioned DAX device and used kpartx to
> create dm-linear devices for each partition, will DAX still work through
> that?

The device-mapper support will continue, but it will be limited to
whole device sub-components. I.e. you could use kpartx to carve up
/dev/pmem0 and still have dax, but not partitions of /dev/pmem0.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 17:29                       ` Dan Williams
@ 2020-01-07 18:01                         ` Vivek Goyal
  2020-01-07 18:07                           ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-07 18:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > implementation to expect the block-device to be available.
> > > > > >
> > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > indicated by the block-device partition. That would open up more
> > > > > > cleanup opportunities.
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > After a long time I got time to look at it again. Want to work on this
> > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > >
> > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > we want to create one dax_device per partition, then it looks like this
> > > > > will be structured more along the lines how block layer handles disk and
> > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > including partition 0). That probably means state belong to whole device
> > > > > will be in common structure say dax_device_common, and per partition state
> > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > dax_device_common.
> > > > >
> > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > partitions be exported to user space.
> > > >
> > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > devices and then remove them after a cycle or two?
> > >
> > > That does seem a better plan than trying to force partition support
> > > where it is not needed.
> >
> > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > create dm-linear devices for each partition, will DAX still work through
> > that?
> 
> The device-mapper support will continue, but it will be limited to
> whole device sub-components. I.e. you could use kpartx to carve up
> /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.

So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
is a block device, I thought tools will expect it to be partitioned.
Sometimes I create those partitions and use /dev/pmem0. So what's
the replacement for this. People often have tools/scripts which might
want to partition the device and these will start failing. 

IOW, I do not understand that why being able to partition /dev/pmem0
(which is a block device from user space point of view), is pointless.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 18:01                         ` Vivek Goyal
@ 2020-01-07 18:07                           ` Dan Williams
  2020-01-07 18:33                             ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-07 18:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > implementation to expect the block-device to be available.
> > > > > > >
> > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > cleanup opportunities.
> > > > > >
> > > > > > Hi Dan,
> > > > > >
> > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > >
> > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > including partition 0). That probably means state belong to whole device
> > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > dax_device_common.
> > > > > >
> > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > partitions be exported to user space.
> > > > >
> > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > devices and then remove them after a cycle or two?
> > > >
> > > > That does seem a better plan than trying to force partition support
> > > > where it is not needed.
> > >
> > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > create dm-linear devices for each partition, will DAX still work through
> > > that?
> >
> > The device-mapper support will continue, but it will be limited to
> > whole device sub-components. I.e. you could use kpartx to carve up
> > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
>
> So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> is a block device, I thought tools will expect it to be partitioned.
> Sometimes I create those partitions and use /dev/pmem0. So what's
> the replacement for this. People often have tools/scripts which might
> want to partition the device and these will start failing.

Partitioning will still work, but dax operation will be declined and
fall back to page-cache.

> IOW, I do not understand that why being able to partition /dev/pmem0
> (which is a block device from user space point of view), is pointless.

How about s/pointless/redundant/. Persistent memory can already be
"partitioned" via namespace boundaries. Block device partitioning is
then redundant and needlessly complicates, as you have found, the
kernel implementation.

The problem will be people that were on dax+ext4 on partitions. Those
people will see a hard failure at mount whereas XFS will fallback to
page cache with a warning in the log. I think ext4 must convert to the
xfs dax handling model before partition support is dropped.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 18:07                           ` Dan Williams
@ 2020-01-07 18:33                             ` Vivek Goyal
  2020-01-07 18:49                               ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-07 18:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > implementation to expect the block-device to be available.
> > > > > > > >
> > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > cleanup opportunities.
> > > > > > >
> > > > > > > Hi Dan,
> > > > > > >
> > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > >
> > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > dax_device_common.
> > > > > > >
> > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > partitions be exported to user space.
> > > > > >
> > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > devices and then remove them after a cycle or two?
> > > > >
> > > > > That does seem a better plan than trying to force partition support
> > > > > where it is not needed.
> > > >
> > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > create dm-linear devices for each partition, will DAX still work through
> > > > that?
> > >
> > > The device-mapper support will continue, but it will be limited to
> > > whole device sub-components. I.e. you could use kpartx to carve up
> > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> >
> > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > is a block device, I thought tools will expect it to be partitioned.
> > Sometimes I create those partitions and use /dev/pmem0. So what's
> > the replacement for this. People often have tools/scripts which might
> > want to partition the device and these will start failing.
> 
> Partitioning will still work, but dax operation will be declined and
> fall back to page-cache.

Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
filesystem will fall back to using page cache. (But dax will not be
enabled).

> 
> > IOW, I do not understand that why being able to partition /dev/pmem0
> > (which is a block device from user space point of view), is pointless.
> 
> How about s/pointless/redundant/. Persistent memory can already be
> "partitioned" via namespace boundaries.

But that's an entirely different way of partitioning. To me being able
to use block devices (with dax capability) in same way as any other
block device makes sense.

> Block device partitioning is
> then redundant and needlessly complicates, as you have found, the
> kernel implementation.

It does complicate kernel implementation. Is it too hard to solve the
problem in kernel.

W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
dax code refers back to block device to figure out partition offset in
dax device. If we create a dax object corresponding to "struct block_device"
and store sector offset in that, then we could pass that object to dax
code and not worry about referring back to bdev. I have written some
proof of concept code and called that object "dax_handle". I can post
that code if there is interest.

IMHO, it feels useful to be able to partition and use a dax capable
block device in same way as non-dax block device. It will be really
odd to think that if filesystem is on /dev/pmem0p1, then dax can't
be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
will work.

Thanks
Vivek

> 
> The problem will be people that were on dax+ext4 on partitions. Those
> people will see a hard failure at mount whereas XFS will fallback to
> page cache with a warning in the log. I think ext4 must convert to the
> xfs dax handling model before partition support is dropped.
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 18:33                             ` Vivek Goyal
@ 2020-01-07 18:49                               ` Dan Williams
  2020-01-07 19:02                                 ` Darrick J. Wong
  2020-01-09 11:24                                 ` Jan Kara
  0 siblings, 2 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-07 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > >
> > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > > implementation to expect the block-device to be available.
> > > > > > > > >
> > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > > cleanup opportunities.
> > > > > > > >
> > > > > > > > Hi Dan,
> > > > > > > >
> > > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > > >
> > > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > > dax_device_common.
> > > > > > > >
> > > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > > partitions be exported to user space.
> > > > > > >
> > > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > > devices and then remove them after a cycle or two?
> > > > > >
> > > > > > That does seem a better plan than trying to force partition support
> > > > > > where it is not needed.
> > > > >
> > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > > create dm-linear devices for each partition, will DAX still work through
> > > > > that?
> > > >
> > > > The device-mapper support will continue, but it will be limited to
> > > > whole device sub-components. I.e. you could use kpartx to carve up
> > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> > >
> > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > > is a block device, I thought tools will expect it to be partitioned.
> > > Sometimes I create those partitions and use /dev/pmem0. So what's
> > > the replacement for this. People often have tools/scripts which might
> > > want to partition the device and these will start failing.
> >
> > Partitioning will still work, but dax operation will be declined and
> > fall back to page-cache.
>
> Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
> filesystem will fall back to using page cache. (But dax will not be
> enabled).
>
> >
> > > IOW, I do not understand that why being able to partition /dev/pmem0
> > > (which is a block device from user space point of view), is pointless.
> >
> > How about s/pointless/redundant/. Persistent memory can already be
> > "partitioned" via namespace boundaries.
>
> But that's an entirely different way of partitioning. To me being able
> to use block devices (with dax capability) in same way as any other
> block device makes sense.
>
> > Block device partitioning is
> > then redundant and needlessly complicates, as you have found, the
> > kernel implementation.
>
> It does complicate kernel implementation. Is it too hard to solve the
> problem in kernel.
>
> W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> dax code refers back to block device to figure out partition offset in
> dax device. If we create a dax object corresponding to "struct block_device"
> and store sector offset in that, then we could pass that object to dax
> code and not worry about referring back to bdev. I have written some
> proof of concept code and called that object "dax_handle". I can post
> that code if there is interest.

I don't think it's worth it in the end especially considering
filesystems are looking to operate on /dev/dax devices directly and
remove block entanglements entirely.

> IMHO, it feels useful to be able to partition and use a dax capable
> block device in same way as non-dax block device. It will be really
> odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> will work.

That can already happen today. If you do not properly align the
partition then dax operations will be disabled. This proposal just
extends that existing failure domain to make all partitions fail to
support dax.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 18:49                               ` Dan Williams
@ 2020-01-07 19:02                                 ` Darrick J. Wong
  2020-01-07 19:46                                   ` Dan Williams
  2020-01-09 11:24                                 ` Jan Kara
  1 sibling, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2020-01-07 19:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 07, 2020 at 10:49:55AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > > > implementation to expect the block-device to be available.
> > > > > > > > > >
> > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > > > > > > > dax_device could be dynamically created to represent the subset range
> > > > > > > > > > indicated by the block-device partition. That would open up more
> > > > > > > > > > cleanup opportunities.
> > > > > > > > >
> > > > > > > > > Hi Dan,
> > > > > > > > >
> > > > > > > > > After a long time I got time to look at it again. Want to work on this
> > > > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > > > >
> > > > > > > > > I am not sure I understand the requirements fully. I see that right now
> > > > > > > > > dax_device is created per device and all block partitions refer to it. If
> > > > > > > > > we want to create one dax_device per partition, then it looks like this
> > > > > > > > > will be structured more along the lines how block layer handles disk and
> > > > > > > > > partitions. (One gendisk for disk and block_devices for partitions,
> > > > > > > > > including partition 0). That probably means state belong to whole device
> > > > > > > > > will be in common structure say dax_device_common, and per partition state
> > > > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > > > dax_device_common.
> > > > > > > > >
> > > > > > > > > I am also not sure what does it mean to partition dax devices. How will
> > > > > > > > > partitions be exported to user space.
> > > > > > > >
> > > > > > > > Dan, last time we talked you agreed that partitioned dax devices are
> > > > > > > > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > > > > > > > devices and then remove them after a cycle or two?
> > > > > > >
> > > > > > > That does seem a better plan than trying to force partition support
> > > > > > > where it is not needed.
> > > > > >
> > > > > > Question: if one /did/ have a partitioned DAX device and used kpartx to
> > > > > > create dm-linear devices for each partition, will DAX still work through
> > > > > > that?
> > > > >
> > > > > The device-mapper support will continue, but it will be limited to
> > > > > whole device sub-components. I.e. you could use kpartx to carve up
> > > > > /dev/pmem0 and still have dax, but not partitions of /dev/pmem0.
> > > >
> > > > So we can't use fdisk/parted to partition /dev/pmem0. Given /dev/pmem0
> > > > is a block device, I thought tools will expect it to be partitioned.
> > > > Sometimes I create those partitions and use /dev/pmem0. So what's
> > > > the replacement for this. People often have tools/scripts which might
> > > > want to partition the device and these will start failing.
> > >
> > > Partitioning will still work, but dax operation will be declined and
> > > fall back to page-cache.
> >
> > Ok, so if I mount /dev/pmem0p1 with dax enabled, that might fail or
> > filesystem will fall back to using page cache. (But dax will not be
> > enabled).
> >
> > >
> > > > IOW, I do not understand that why being able to partition /dev/pmem0
> > > > (which is a block device from user space point of view), is pointless.
> > >
> > > How about s/pointless/redundant/. Persistent memory can already be
> > > "partitioned" via namespace boundaries.
> >
> > But that's an entirely different way of partitioning. To me being able
> > to use block devices (with dax capability) in same way as any other
> > block device makes sense.
> >
> > > Block device partitioning is
> > > then redundant and needlessly complicates, as you have found, the
> > > kernel implementation.
> >
> > It does complicate kernel implementation. Is it too hard to solve the
> > problem in kernel.
> >
> > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > dax code refers back to block device to figure out partition offset in
> > dax device. If we create a dax object corresponding to "struct block_device"
> > and store sector offset in that, then we could pass that object to dax
> > code and not worry about referring back to bdev. I have written some
> > proof of concept code and called that object "dax_handle". I can post
> > that code if there is interest.
> 
> I don't think it's worth it in the end especially considering
> filesystems are looking to operate on /dev/dax devices directly and
> remove block entanglements entirely.
> 
> > IMHO, it feels useful to be able to partition and use a dax capable
> > block device in same way as non-dax block device. It will be really
> > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > will work.
> 
> That can already happen today. If you do not properly align the
> partition then dax operations will be disabled.

Er... is this conversation getting confused?  I was talking about
kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel
creating /dev/pmem0p1.  I thnk Vivek was complaining about the
inconsistent behavior between the two, even if the partition is aligned
properly.

I'm not sure how alignment leaked in here?

> This proposal just
> extends that existing failure domain to make all partitions fail to
> support dax.

Oh, wait.  You're proposing that "partitions of pmem devices don't
support DAX", not "the kernel will not create partitions for pmem
devices".

Yeah, that would be inconsistent and weird.  I'd say deprecate the
kernel automounting partitions, but I guess it already does that, and
removing it would break /something/.  I guess you could put
"/dev/pmemXpY" on the deprecation schedule.

--D

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 19:02                                 ` Darrick J. Wong
@ 2020-01-07 19:46                                   ` Dan Williams
  2020-01-07 23:38                                     ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-07 19:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Vivek Goyal, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 11:03 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
[..]
> > That can already happen today. If you do not properly align the
> > partition then dax operations will be disabled.
>
> Er... is this conversation getting confused?  I was talking about
> kpartx's /dev/mapper/pmem0p1 being a straight replacement for the kernel
> creating /dev/pmem0p1.  I thnk Vivek was complaining about the
> inconsistent behavior between the two, even if the partition is aligned
> properly.
>
> I'm not sure how alignment leaked in here?

Oh, whoops, I was jumping to the mismatch between host device and
partition and whether we had precedent to fail to support dax on the
partition when the base block device does support it.

But yes, the mismatch between kpartx and native partitions is weird.
That said kpartx is there to add partition support where the kernel
for whatever reason fails to, or chooses not to, and dax is looking
like such a place.

> > This proposal just
> > extends that existing failure domain to make all partitions fail to
> > support dax.
>
> Oh, wait.  You're proposing that "partitions of pmem devices don't
> support DAX", not "the kernel will not create partitions for pmem
> devices".
>
> Yeah, that would be inconsistent and weird.

More weird than the current constraints?

> I'd say deprecate the
> kernel automounting partitions, but I guess it already does that, and

Ok, now I don't know why automounting is leaking into this discussion?

> removing it would break /something/.

Yes, the breakage risk is anyone that was using ext4 mount failure as
a dax capability detector.

> I guess you could put
> "/dev/pmemXpY" on the deprecation schedule.

...but why deprecate /dev/pmemXpY partitions altogether? If someone
doesn't care about dax then they can do all the legacy block things.
If they do care about dax then work with whole device namespaces.

The proposal is to detect dax on partitions and warn people to move to
kpartx. Let the core fs/dax implementation continue to shed block
dependencies.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 19:46                                   ` Dan Williams
@ 2020-01-07 23:38                                     ` Dan Williams
  0 siblings, 0 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-07 23:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Vivek Goyal, Christoph Hellwig, Dave Chinner, Miklos Szeredi,
	linux-nvdimm, Linux Kernel Mailing List, Dr. David Alan Gilbert,
	virtio-fs, Stefan Hajnoczi, linux-fsdevel

On Tue, Jan 7, 2020 at 11:46 AM Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> > I'd say deprecate the
> > kernel automounting partitions, but I guess it already does that, and
>
> Ok, now I don't know why automounting is leaking into this discussion?
>
> > removing it would break /something/.
>
> Yes, the breakage risk is anyone that was using ext4 mount failure as
> a dax capability detector.
>
> > I guess you could put
> > "/dev/pmemXpY" on the deprecation schedule.
>
> ...but why deprecate /dev/pmemXpY partitions altogether? If someone
> doesn't care about dax then they can do all the legacy block things.
> If they do care about dax then work with whole device namespaces.

Circling back on this point now that I understand what you meant by
automount. It would need to be a full deprecation of /dev/pmemXpY
devices if kpartx dax support is going to fully take over for people
that want to use disk partition tables instead of EFI Namespace Labels
to carve up pmem.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-07 18:49                               ` Dan Williams
  2020-01-07 19:02                                 ` Darrick J. Wong
@ 2020-01-09 11:24                                 ` Jan Kara
  2020-01-09 20:03                                   ` Dan Williams
  1 sibling, 1 reply; 77+ messages in thread
From: Jan Kara @ 2020-01-09 11:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel

On Tue 07-01-20 10:49:55, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > dax code refers back to block device to figure out partition offset in
> > dax device. If we create a dax object corresponding to "struct block_device"
> > and store sector offset in that, then we could pass that object to dax
> > code and not worry about referring back to bdev. I have written some
> > proof of concept code and called that object "dax_handle". I can post
> > that code if there is interest.
> 
> I don't think it's worth it in the end especially considering
> filesystems are looking to operate on /dev/dax devices directly and
> remove block entanglements entirely.
> 
> > IMHO, it feels useful to be able to partition and use a dax capable
> > block device in same way as non-dax block device. It will be really
> > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > will work.
> 
> That can already happen today. If you do not properly align the
> partition then dax operations will be disabled. This proposal just
> extends that existing failure domain to make all partitions fail to
> support dax.

Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
decides to create partitions on it for whatever (possibly misguided)
reason and then ponders why the hell DAX is not working? And PAGE_SIZE
partition alignment is so obvious and widespread that I don't count it as a
realistic error case sysadmins would be pondering about currently.

So I'd find two options reasonably consistent:
1) Keep status quo where partitions are created and support DAX.
2) Stop partition creation altogether, if anyones wants to split pmem
device further, he can use dm-linear for that (i.e., kpartx).

But I'm not sure if the ship hasn't already sailed for option 2) to be
feasible without angry users and Linus reverting the change.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-09 11:24                                 ` Jan Kara
@ 2020-01-09 20:03                                   ` Dan Williams
  2020-01-10 12:36                                     ` Christoph Hellwig
  2020-01-14 20:31                                     ` Vivek Goyal
  0 siblings, 2 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-09 20:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel

On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > dax code refers back to block device to figure out partition offset in
> > > dax device. If we create a dax object corresponding to "struct block_device"
> > > and store sector offset in that, then we could pass that object to dax
> > > code and not worry about referring back to bdev. I have written some
> > > proof of concept code and called that object "dax_handle". I can post
> > > that code if there is interest.
> >
> > I don't think it's worth it in the end especially considering
> > filesystems are looking to operate on /dev/dax devices directly and
> > remove block entanglements entirely.
> >
> > > IMHO, it feels useful to be able to partition and use a dax capable
> > > block device in same way as non-dax block device. It will be really
> > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > will work.
> >
> > That can already happen today. If you do not properly align the
> > partition then dax operations will be disabled. This proposal just
> > extends that existing failure domain to make all partitions fail to
> > support dax.
>
> Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> decides to create partitions on it for whatever (possibly misguided)
> reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> partition alignment is so obvious and widespread that I don't count it as a
> realistic error case sysadmins would be pondering about currently.
>
> So I'd find two options reasonably consistent:
> 1) Keep status quo where partitions are created and support DAX.
> 2) Stop partition creation altogether, if anyones wants to split pmem
> device further, he can use dm-linear for that (i.e., kpartx).
>
> But I'm not sure if the ship hasn't already sailed for option 2) to be
> feasible without angry users and Linus reverting the change.

Christoph? I feel myself leaning more and more to the "keep pmem
partitions" camp.

I don't see "drop partition support" effort ending well given the long
standing "ext4 fails to mount when dax is not available" precedent.

I think the next least bad option is to have a dax_get_by_host()
variant that passes an offset and length pair rather than requiring a
later bdev_dax_pgoff() to recall the offset. This also prevents
needing to add another dax-device object representation.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-09 20:03                                   ` Dan Williams
@ 2020-01-10 12:36                                     ` Christoph Hellwig
  2020-01-14 20:31                                     ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Hellwig @ 2020-01-10 12:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel

On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > So I'd find two options reasonably consistent:
> > 1) Keep status quo where partitions are created and support DAX.
> > 2) Stop partition creation altogether, if anyones wants to split pmem
> > device further, he can use dm-linear for that (i.e., kpartx).
> >
> > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > feasible without angry users and Linus reverting the change.
> 
> Christoph? I feel myself leaning more and more to the "keep pmem
> partitions" camp.
> 
> I don't see "drop partition support" effort ending well given the long
> standing "ext4 fails to mount when dax is not available" precedent.

Do we have any evidence of existing setups with DAX and partitions?
Can we just throw in a patch to reject that case for now before actually
removing the code and see if anyone screams.  And fix ext4 up while
we are at it.

> I think the next least bad option is to have a dax_get_by_host()
> variant that passes an offset and length pair rather than requiring a
> later bdev_dax_pgoff() to recall the offset. This also prevents
> needing to add another dax-device object representation.

IFF we have to keep partition support, yes.  But keeping it just seems
like a really bad idea.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-09 20:03                                   ` Dan Williams
  2020-01-10 12:36                                     ` Christoph Hellwig
@ 2020-01-14 20:31                                     ` Vivek Goyal
  2020-01-14 20:39                                       ` Dan Williams
  1 sibling, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-14 20:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel

On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > dax code refers back to block device to figure out partition offset in
> > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > and store sector offset in that, then we could pass that object to dax
> > > > code and not worry about referring back to bdev. I have written some
> > > > proof of concept code and called that object "dax_handle". I can post
> > > > that code if there is interest.
> > >
> > > I don't think it's worth it in the end especially considering
> > > filesystems are looking to operate on /dev/dax devices directly and
> > > remove block entanglements entirely.
> > >
> > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > block device in same way as non-dax block device. It will be really
> > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > will work.
> > >
> > > That can already happen today. If you do not properly align the
> > > partition then dax operations will be disabled. This proposal just
> > > extends that existing failure domain to make all partitions fail to
> > > support dax.
> >
> > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > decides to create partitions on it for whatever (possibly misguided)
> > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > partition alignment is so obvious and widespread that I don't count it as a
> > realistic error case sysadmins would be pondering about currently.
> >
> > So I'd find two options reasonably consistent:
> > 1) Keep status quo where partitions are created and support DAX.
> > 2) Stop partition creation altogether, if anyones wants to split pmem
> > device further, he can use dm-linear for that (i.e., kpartx).
> >
> > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > feasible without angry users and Linus reverting the change.
> 
> Christoph? I feel myself leaning more and more to the "keep pmem
> partitions" camp.
> 
> I don't see "drop partition support" effort ending well given the long
> standing "ext4 fails to mount when dax is not available" precedent.
> 
> I think the next least bad option is to have a dax_get_by_host()
> variant that passes an offset and length pair rather than requiring a
> later bdev_dax_pgoff() to recall the offset. This also prevents
> needing to add another dax-device object representation.

I am wondering what's the conclusion on this. I want to this to make
progress in some direction so that I can make progress on virtiofs DAX
support.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-14 20:31                                     ` Vivek Goyal
@ 2020-01-14 20:39                                       ` Dan Williams
  2020-01-14 21:28                                         ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-14 20:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel

On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > dax code refers back to block device to figure out partition offset in
> > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > and store sector offset in that, then we could pass that object to dax
> > > > > code and not worry about referring back to bdev. I have written some
> > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > that code if there is interest.
> > > >
> > > > I don't think it's worth it in the end especially considering
> > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > remove block entanglements entirely.
> > > >
> > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > block device in same way as non-dax block device. It will be really
> > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > will work.
> > > >
> > > > That can already happen today. If you do not properly align the
> > > > partition then dax operations will be disabled. This proposal just
> > > > extends that existing failure domain to make all partitions fail to
> > > > support dax.
> > >
> > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > decides to create partitions on it for whatever (possibly misguided)
> > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > partition alignment is so obvious and widespread that I don't count it as a
> > > realistic error case sysadmins would be pondering about currently.
> > >
> > > So I'd find two options reasonably consistent:
> > > 1) Keep status quo where partitions are created and support DAX.
> > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > device further, he can use dm-linear for that (i.e., kpartx).
> > >
> > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > feasible without angry users and Linus reverting the change.
> >
> > Christoph? I feel myself leaning more and more to the "keep pmem
> > partitions" camp.
> >
> > I don't see "drop partition support" effort ending well given the long
> > standing "ext4 fails to mount when dax is not available" precedent.
> >
> > I think the next least bad option is to have a dax_get_by_host()
> > variant that passes an offset and length pair rather than requiring a
> > later bdev_dax_pgoff() to recall the offset. This also prevents
> > needing to add another dax-device object representation.
>
> I am wondering what's the conclusion on this. I want to this to make
> progress in some direction so that I can make progress on virtiofs DAX
> support.

I think we should at least try to delete the partition support and see
if anyone screams. Have a module option to revert the behavior so
people are not stuck waiting for the revert to land, but if it stays
quiet then we're in a better place with that support pushed out of the
dax core.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-14 20:39                                       ` Dan Williams
@ 2020-01-14 21:28                                         ` Vivek Goyal
  2020-01-14 22:23                                           ` Dan Williams
  2020-01-15  9:03                                           ` Jan Kara
  0 siblings, 2 replies; 77+ messages in thread
From: Vivek Goyal @ 2020-01-14 21:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel, Jeff Moyer

On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > that code if there is interest.
> > > > >
> > > > > I don't think it's worth it in the end especially considering
> > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > remove block entanglements entirely.
> > > > >
> > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > block device in same way as non-dax block device. It will be really
> > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > will work.
> > > > >
> > > > > That can already happen today. If you do not properly align the
> > > > > partition then dax operations will be disabled. This proposal just
> > > > > extends that existing failure domain to make all partitions fail to
> > > > > support dax.
> > > >
> > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > decides to create partitions on it for whatever (possibly misguided)
> > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > realistic error case sysadmins would be pondering about currently.
> > > >
> > > > So I'd find two options reasonably consistent:
> > > > 1) Keep status quo where partitions are created and support DAX.
> > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > >
> > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > feasible without angry users and Linus reverting the change.
> > >
> > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > partitions" camp.
> > >
> > > I don't see "drop partition support" effort ending well given the long
> > > standing "ext4 fails to mount when dax is not available" precedent.
> > >
> > > I think the next least bad option is to have a dax_get_by_host()
> > > variant that passes an offset and length pair rather than requiring a
> > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > needing to add another dax-device object representation.
> >
> > I am wondering what's the conclusion on this. I want to this to make
> > progress in some direction so that I can make progress on virtiofs DAX
> > support.
> 
> I think we should at least try to delete the partition support and see
> if anyone screams. Have a module option to revert the behavior so
> people are not stuck waiting for the revert to land, but if it stays
> quiet then we're in a better place with that support pushed out of the
> dax core.

Hi Dan,

So basically keep partition support code just that disable it by default
and it is enabled by some knob say kernel command line option/module
option.

At what point of time will we remove that code completely. I mean what
if people scream after two kernel releases, after we have removed the
code.

Also, from distribution's perspective, we might not hear from our
customers for a very long time (till we backport that code in to
existing releases or release this new code in next major release). From
that view point I will not like to break existing user visible behavior.

How bad it is to keep partition support around. To me it feels reasonaly
simple where we just have to store offset into dax device into another
dax object and pass that object around (instead of dax_device). If that's
the case, I am not sure why to even venture into a direction where some
user's setup might be broken.

Also from an application perspective, /dev/pmem is a block device, so it
should behave like a block device, (including kernel partition table support).
From that view, dax looks like just an additional feature of that device
which can be enabled by passing option "-o dax".

IOW, can we reconsider the idea of not supporting kernel partition tables
for dax capable block devices. I can only see downsides of removing kernel
partition table support and only upside seems to be little cleanup of dax
core code.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-14 21:28                                         ` Vivek Goyal
@ 2020-01-14 22:23                                           ` Dan Williams
  2020-01-15 19:56                                             ` Vivek Goyal
  2020-01-15  9:03                                           ` Jan Kara
  1 sibling, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-14 22:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel, Jeff Moyer

On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > > that code if there is interest.
> > > > > >
> > > > > > I don't think it's worth it in the end especially considering
> > > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > > remove block entanglements entirely.
> > > > > >
> > > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > > block device in same way as non-dax block device. It will be really
> > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > > will work.
> > > > > >
> > > > > > That can already happen today. If you do not properly align the
> > > > > > partition then dax operations will be disabled. This proposal just
> > > > > > extends that existing failure domain to make all partitions fail to
> > > > > > support dax.
> > > > >
> > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > > decides to create partitions on it for whatever (possibly misguided)
> > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > > realistic error case sysadmins would be pondering about currently.
> > > > >
> > > > > So I'd find two options reasonably consistent:
> > > > > 1) Keep status quo where partitions are created and support DAX.
> > > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > > >
> > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > > feasible without angry users and Linus reverting the change.
> > > >
> > > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > > partitions" camp.
> > > >
> > > > I don't see "drop partition support" effort ending well given the long
> > > > standing "ext4 fails to mount when dax is not available" precedent.
> > > >
> > > > I think the next least bad option is to have a dax_get_by_host()
> > > > variant that passes an offset and length pair rather than requiring a
> > > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > > needing to add another dax-device object representation.
> > >
> > > I am wondering what's the conclusion on this. I want to this to make
> > > progress in some direction so that I can make progress on virtiofs DAX
> > > support.
> >
> > I think we should at least try to delete the partition support and see
> > if anyone screams. Have a module option to revert the behavior so
> > people are not stuck waiting for the revert to land, but if it stays
> > quiet then we're in a better place with that support pushed out of the
> > dax core.
>
> Hi Dan,
>
> So basically keep partition support code just that disable it by default
> and it is enabled by some knob say kernel command line option/module
> option.

Yes.

> At what point of time will we remove that code completely. I mean what
> if people scream after two kernel releases, after we have removed the
> code.

I'd follow the typical timelines of Documentation/ABI/obsolete which
is a year or more.

>
> Also, from distribution's perspective, we might not hear from our
> customers for a very long time (till we backport that code in to
> existing releases or release this new code in next major release). From
> that view point I will not like to break existing user visible behavior.
>
> How bad it is to keep partition support around. To me it feels reasonaly
> simple where we just have to store offset into dax device into another
> dax object:

If we end up keeping partition support, we're not adding another object.

> and pass that object around (instead of dax_device). If that's
> the case, I am not sure why to even venture into a direction where some
> user's setup might be broken.

It was a mistake to support them. If that mistake can be undone
without breaking existing deployments the code base is better off
without the concept.

> Also from an application perspective, /dev/pmem is a block device, so it
> should behave like a block device, (including kernel partition table support).
> From that view, dax looks like just an additional feature of that device
> which can be enabled by passing option "-o dax".

dax via block devices was a crutch that we leaned on too heavily, and
the implementation has slowly been moving away from it ever since.

> IOW, can we reconsider the idea of not supporting kernel partition tables
> for dax capable block devices. I can only see downsides of removing kernel
> partition table support and only upside seems to be little cleanup of dax
> core code.

Can you help find end users that depend on it? Even the Red Hat
installation guide example shows mounting on pmem0 directly. [1]

My primary concern is people that might be booting from pmem as boot
support requires an EFI partition table, and initramfs images would
need to be respun to move to kpartx.

[1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/storage_administration_guide/index#Configuring-Persistent-Memory-for-File-System-Direct-Access-DAX

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-14 21:28                                         ` Vivek Goyal
  2020-01-14 22:23                                           ` Dan Williams
@ 2020-01-15  9:03                                           ` Jan Kara
  1 sibling, 0 replies; 77+ messages in thread
From: Jan Kara @ 2020-01-15  9:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dan Williams, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel, Jeff Moyer

On Tue 14-01-20 16:28:05, Vivek Goyal wrote:
> On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > I think we should at least try to delete the partition support and see
> > if anyone screams. Have a module option to revert the behavior so
> > people are not stuck waiting for the revert to land, but if it stays
> > quiet then we're in a better place with that support pushed out of the
> > dax core.
> 
> Hi Dan,
> 
> So basically keep partition support code just that disable it by default
> and it is enabled by some knob say kernel command line option/module
> option.
> 
> At what point of time will we remove that code completely. I mean what
> if people scream after two kernel releases, after we have removed the
> code.
> 
> Also, from distribution's perspective, we might not hear from our
> customers for a very long time (till we backport that code in to
> existing releases or release this new code in next major release). From
> that view point I will not like to break existing user visible behavior.
> 
> How bad it is to keep partition support around. To me it feels reasonaly
> simple where we just have to store offset into dax device into another
> dax object and pass that object around (instead of dax_device). If that's
> the case, I am not sure why to even venture into a direction where some
> user's setup might be broken.
> 
> Also from an application perspective, /dev/pmem is a block device, so it
> should behave like a block device, (including kernel partition table support).
> From that view, dax looks like just an additional feature of that device
> which can be enabled by passing option "-o dax".

Well, not all block devices are partitionable. For example cdroms are
standard block devices but partitioning does not run for them. Similarly
device mapper devices are block devices but not partitioned. So there is
some precedens in not doing partitioning for some types of block devices.

For the rest I agree that kernels where pmem devices are partitionable have
shipped in enterprise distros and are going to be supported (and used) for
5-10 years before users decide to move on to something newer - at which
point we'll only find out whether someone used the feature or not. So
deprecation is going to be somewhat interesting. On the other hand clever
udev rule that detects partition table on pmem device and uses kpartx to
partition these devices (like what happens e.g. for dm-multipath devices)
could possibly be used as a replacement for kernel support so there's a way
out of this...

So I don't care too deeply about what the decision is going to be.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-14 22:23                                           ` Dan Williams
@ 2020-01-15 19:56                                             ` Vivek Goyal
  2020-01-15 20:17                                               ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-15 19:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel, Jeff Moyer

On Tue, Jan 14, 2020 at 02:23:04PM -0800, Dan Williams wrote:
> On Tue, Jan 14, 2020 at 1:28 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Jan 14, 2020 at 12:39:00PM -0800, Dan Williams wrote:
> > > On Tue, Jan 14, 2020 at 12:31 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 09, 2020 at 12:03:01PM -0800, Dan Williams wrote:
> > > > > On Thu, Jan 9, 2020 at 3:27 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Tue 07-01-20 10:49:55, Dan Williams wrote:
> > > > > > > On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > > > > W.r.t partitioning, bdev_dax_pgoff() seems to be the pain point where
> > > > > > > > dax code refers back to block device to figure out partition offset in
> > > > > > > > dax device. If we create a dax object corresponding to "struct block_device"
> > > > > > > > and store sector offset in that, then we could pass that object to dax
> > > > > > > > code and not worry about referring back to bdev. I have written some
> > > > > > > > proof of concept code and called that object "dax_handle". I can post
> > > > > > > > that code if there is interest.
> > > > > > >
> > > > > > > I don't think it's worth it in the end especially considering
> > > > > > > filesystems are looking to operate on /dev/dax devices directly and
> > > > > > > remove block entanglements entirely.
> > > > > > >
> > > > > > > > IMHO, it feels useful to be able to partition and use a dax capable
> > > > > > > > block device in same way as non-dax block device. It will be really
> > > > > > > > odd to think that if filesystem is on /dev/pmem0p1, then dax can't
> > > > > > > > be enabled but if filesystem is on /dev/mapper/pmem0p1, then dax
> > > > > > > > will work.
> > > > > > >
> > > > > > > That can already happen today. If you do not properly align the
> > > > > > > partition then dax operations will be disabled. This proposal just
> > > > > > > extends that existing failure domain to make all partitions fail to
> > > > > > > support dax.
> > > > > >
> > > > > > Well, I have some sympathy with the sysadmin that has /dev/pmem0 device,
> > > > > > decides to create partitions on it for whatever (possibly misguided)
> > > > > > reason and then ponders why the hell DAX is not working? And PAGE_SIZE
> > > > > > partition alignment is so obvious and widespread that I don't count it as a
> > > > > > realistic error case sysadmins would be pondering about currently.
> > > > > >
> > > > > > So I'd find two options reasonably consistent:
> > > > > > 1) Keep status quo where partitions are created and support DAX.
> > > > > > 2) Stop partition creation altogether, if anyones wants to split pmem
> > > > > > device further, he can use dm-linear for that (i.e., kpartx).
> > > > > >
> > > > > > But I'm not sure if the ship hasn't already sailed for option 2) to be
> > > > > > feasible without angry users and Linus reverting the change.
> > > > >
> > > > > Christoph? I feel myself leaning more and more to the "keep pmem
> > > > > partitions" camp.
> > > > >
> > > > > I don't see "drop partition support" effort ending well given the long
> > > > > standing "ext4 fails to mount when dax is not available" precedent.
> > > > >
> > > > > I think the next least bad option is to have a dax_get_by_host()
> > > > > variant that passes an offset and length pair rather than requiring a
> > > > > later bdev_dax_pgoff() to recall the offset. This also prevents
> > > > > needing to add another dax-device object representation.
> > > >
> > > > I am wondering what's the conclusion on this. I want to this to make
> > > > progress in some direction so that I can make progress on virtiofs DAX
> > > > support.
> > >
> > > I think we should at least try to delete the partition support and see
> > > if anyone screams. Have a module option to revert the behavior so
> > > people are not stuck waiting for the revert to land, but if it stays
> > > quiet then we're in a better place with that support pushed out of the
> > > dax core.
> >
> > Hi Dan,
> >
> > So basically keep partition support code just that disable it by default
> > and it is enabled by some knob say kernel command line option/module
> > option.
> 
> Yes.
> 
> > At what point of time will we remove that code completely. I mean what
> > if people scream after two kernel releases, after we have removed the
> > code.
> 
> I'd follow the typical timelines of Documentation/ABI/obsolete which
> is a year or more.
> 
> >
> > Also, from distribution's perspective, we might not hear from our
> > customers for a very long time (till we backport that code in to
> > existing releases or release this new code in next major release). From
> > that view point I will not like to break existing user visible behavior.
> >
> > How bad it is to keep partition support around. To me it feels reasonaly
> > simple where we just have to store offset into dax device into another
> > dax object:
> 
> If we end up keeping partition support, we're not adding another object.
> 
> > and pass that object around (instead of dax_device). If that's
> > the case, I am not sure why to even venture into a direction where some
> > user's setup might be broken.
> 
> It was a mistake to support them. If that mistake can be undone
> without breaking existing deployments the code base is better off
> without the concept.
> 
> > Also from an application perspective, /dev/pmem is a block device, so it
> > should behave like a block device, (including kernel partition table support).
> > From that view, dax looks like just an additional feature of that device
> > which can be enabled by passing option "-o dax".
> 
> dax via block devices was a crutch that we leaned on too heavily, and
> the implementation has slowly been moving away from it ever since.
> 
> > IOW, can we reconsider the idea of not supporting kernel partition tables
> > for dax capable block devices. I can only see downsides of removing kernel
> > partition table support and only upside seems to be little cleanup of dax
> > core code.
> 
> Can you help find end users that depend on it?

I can't think of a real user at this point of time. Just that I am
concerned, once the change goes in, somebody will get affected at later
point of time and comes out complainig and this change will be seen as
breaking user space and hence regression.

> Even the Red Hat
> installation guide example shows mounting on pmem0 directly. [1]

Below that example it also says.

"When creating partitions on a pmem device to be used for direct access,
partitions must be aligned on page boundaries. On the Intel 64 and AMD64
architecture, at least 4KiB alignment for the start and end of the
partition, but 2MiB is the preferred alignment. By default, the parted
tool aligns partitions on 1MiB boundaries. For the first partition,
specify 2MiB as the start of the partition. If the size of the partition
is a multiple of 2MiB, all other partitions are also aligned."

So documentation is clearly saying dax will work with partitions as well.
And some user might decide to just do that.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-15 19:56                                             ` Vivek Goyal
@ 2020-01-15 20:17                                               ` Dan Williams
  2020-01-15 21:08                                                 ` Jeff Moyer
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-15 20:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Miklos Szeredi, linux-nvdimm, Linux Kernel Mailing List,
	Dr. David Alan Gilbert, virtio-fs, Stefan Hajnoczi,
	linux-fsdevel, Jeff Moyer

On Wed, Jan 15, 2020 at 11:56 AM Vivek Goyal <vgoyal@redhat.com> wrote:
[..]
> > Even the Red Hat
> > installation guide example shows mounting on pmem0 directly. [1]
>
> Below that example it also says.
>
> "When creating partitions on a pmem device to be used for direct access,
> partitions must be aligned on page boundaries. On the Intel 64 and AMD64
> architecture, at least 4KiB alignment for the start and end of the
> partition, but 2MiB is the preferred alignment. By default, the parted
> tool aligns partitions on 1MiB boundaries. For the first partition,
> specify 2MiB as the start of the partition. If the size of the partition
> is a multiple of 2MiB, all other partitions are also aligned."
>
> So documentation is clearly saying dax will work with partitions as well.
> And some user might decide to just do that.

Yes, of course but my point is that it was ambiguous.

I'm going to take a look at how hard it would be to develop a kpartx
fallback in udev. If that can live across the driver transition then
maybe this can be a non-event for end users that already have that
udev update deployed.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-15 20:17                                               ` Dan Williams
@ 2020-01-15 21:08                                                 ` Jeff Moyer
  2020-01-16 18:09                                                   ` Dan Williams
  0 siblings, 1 reply; 77+ messages in thread
From: Jeff Moyer @ 2020-01-15 21:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vivek Goyal, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

Hi, Dan,

Dan Williams <dan.j.williams@intel.com> writes:

> I'm going to take a look at how hard it would be to develop a kpartx
> fallback in udev. If that can live across the driver transition then
> maybe this can be a non-event for end users that already have that
> udev update deployed.

I just wanted to remind you that label-less dimms still exist, and are
still being shipped.  For those devices, the only way to subdivide the
storage is via partitioning.

-Jeff


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-15 21:08                                                 ` Jeff Moyer
@ 2020-01-16 18:09                                                   ` Dan Williams
  2020-01-16 18:39                                                     ` Vivek Goyal
  2020-02-11 17:33                                                     ` Vivek Goyal
  0 siblings, 2 replies; 77+ messages in thread
From: Dan Williams @ 2020-01-16 18:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vivek Goyal, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Hi, Dan,
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > I'm going to take a look at how hard it would be to develop a kpartx
> > fallback in udev. If that can live across the driver transition then
> > maybe this can be a non-event for end users that already have that
> > udev update deployed.
>
> I just wanted to remind you that label-less dimms still exist, and are
> still being shipped.  For those devices, the only way to subdivide the
> storage is via partitioning.

True, but if kpartx + udev can make this transparent then I don't
think users lose any functionality. They just gain a device-mapper
dependency.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-16 18:09                                                   ` Dan Williams
@ 2020-01-16 18:39                                                     ` Vivek Goyal
  2020-01-16 19:09                                                       ` Dan Williams
  2020-02-11 17:33                                                     ` Vivek Goyal
  1 sibling, 1 reply; 77+ messages in thread
From: Vivek Goyal @ 2020-01-16 18:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> > Hi, Dan,
> >
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> > > I'm going to take a look at how hard it would be to develop a kpartx
> > > fallback in udev. If that can live across the driver transition then
> > > maybe this can be a non-event for end users that already have that
> > > udev update deployed.
> >
> > I just wanted to remind you that label-less dimms still exist, and are
> > still being shipped.  For those devices, the only way to subdivide the
> > storage is via partitioning.
> 
> True, but if kpartx + udev can make this transparent then I don't
> think users lose any functionality. They just gain a device-mapper
> dependency.

So udev rules will trigger when a /dev/pmemX device shows up and run
kpartx which in turn will create dm-linear devices and device nodes
will show up in /dev/mapper/pmemXpY.

IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
systemd unit files are depenent on /dev/pmemXpY, these will still be
broken out of the box and will have to be modified to use device nodes
in /dev/mapper/ directory instead. Do I understand it right, Or I missed
the idea completely.

Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-16 18:39                                                     ` Vivek Goyal
@ 2020-01-16 19:09                                                       ` Dan Williams
  2020-01-16 19:23                                                         ` Vivek Goyal
  0 siblings, 1 reply; 77+ messages in thread
From: Dan Williams @ 2020-01-16 19:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Thu, Jan 16, 2020 at 10:39 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> > On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> > >
> > > Hi, Dan,
> > >
> > > Dan Williams <dan.j.williams@intel.com> writes:
> > >
> > > > I'm going to take a look at how hard it would be to develop a kpartx
> > > > fallback in udev. If that can live across the driver transition then
> > > > maybe this can be a non-event for end users that already have that
> > > > udev update deployed.
> > >
> > > I just wanted to remind you that label-less dimms still exist, and are
> > > still being shipped.  For those devices, the only way to subdivide the
> > > storage is via partitioning.
> >
> > True, but if kpartx + udev can make this transparent then I don't
> > think users lose any functionality. They just gain a device-mapper
> > dependency.
>
> So udev rules will trigger when a /dev/pmemX device shows up and run
> kpartx which in turn will create dm-linear devices and device nodes
> will show up in /dev/mapper/pmemXpY.
>
> IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
> systemd unit files are depenent on /dev/pmemXpY, these will still be
> broken out of the box and will have to be modified to use device nodes
> in /dev/mapper/ directory instead. Do I understand it right, Or I missed
> the idea completely.

No, I'd write the udev rule to create links from /dev/pmemXpY to the
/dev/mapper device, and that rule would be gated by a new pmem device
attribute to trigger when kpartx needs to run vs the kernel native
partitions.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-16 19:09                                                       ` Dan Williams
@ 2020-01-16 19:23                                                         ` Vivek Goyal
  0 siblings, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2020-01-16 19:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Thu, Jan 16, 2020 at 11:09:00AM -0800, Dan Williams wrote:

[..]
> > > True, but if kpartx + udev can make this transparent then I don't
> > > think users lose any functionality. They just gain a device-mapper
> > > dependency.
> >
> > So udev rules will trigger when a /dev/pmemX device shows up and run
> > kpartx which in turn will create dm-linear devices and device nodes
> > will show up in /dev/mapper/pmemXpY.
> >
> > IOW, /dev/pmemXpY device nodes will be gone. So if any of the scripts or
> > systemd unit files are depenent on /dev/pmemXpY, these will still be
> > broken out of the box and will have to be modified to use device nodes
> > in /dev/mapper/ directory instead. Do I understand it right, Or I missed
> > the idea completely.
> 
> No, I'd write the udev rule to create links from /dev/pmemXpY to the
> /dev/mapper device, and that rule would be gated by a new pmem device
> attribute to trigger when kpartx needs to run vs the kernel native
> partitions.

Got it. This sounds much better.

Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/19] dax: remove block device dependencies
  2020-01-16 18:09                                                   ` Dan Williams
  2020-01-16 18:39                                                     ` Vivek Goyal
@ 2020-02-11 17:33                                                     ` Vivek Goyal
  1 sibling, 0 replies; 77+ messages in thread
From: Vivek Goyal @ 2020-02-11 17:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Moyer, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	Dave Chinner, Miklos Szeredi, linux-nvdimm,
	Linux Kernel Mailing List, Dr. David Alan Gilbert, virtio-fs,
	Stefan Hajnoczi, linux-fsdevel

On Thu, Jan 16, 2020 at 10:09:46AM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 1:08 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >
> > Hi, Dan,
> >
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> > > I'm going to take a look at how hard it would be to develop a kpartx
> > > fallback in udev. If that can live across the driver transition then
> > > maybe this can be a non-event for end users that already have that
> > > udev update deployed.
> >
> > I just wanted to remind you that label-less dimms still exist, and are
> > still being shipped.  For those devices, the only way to subdivide the
> > storage is via partitioning.
> 
> True, but if kpartx + udev can make this transparent then I don't
> think users lose any functionality. They just gain a device-mapper
> dependency.

Hi Dan,

Are you planning to look into making this work?

We can easily disable partition scanning by specifying gendisk
GENHD_FL_NO_PART_SCAN flag. But what about partition additiona path,
ioctl(BLKPG_ADD_PARTITION). That does not seem to do any checks whether
block device supports in kernel partitions or not. 

So kernel partitions (hence /dev/pmemXpY) objects are created anyway and
this will conflict with all the new planned udev rules.

If you block ioctl(BLKPG_ADD_PARTITION), then user space tools like
parted and fdisk started breaking when trying to create a partition
on /dev/pmeme0. IIUC, we have to allow partition table creation on
/dev/pmem0 so that later kpartx can parse it and create dm-linear
partitions.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2020-02-11 17:33 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 17:57 [PATCH v3 00/19][RFC] virtio-fs: Enable DAX support Vivek Goyal
2019-08-21 17:57 ` [PATCH 01/19] dax: remove block device dependencies Vivek Goyal
2019-08-26 11:51   ` Christoph Hellwig
2019-08-27 16:38     ` Vivek Goyal
2019-08-28  6:58       ` Christoph Hellwig
2019-08-28 17:58         ` Vivek Goyal
2019-08-28 22:53           ` Dave Chinner
2019-08-29  0:04             ` Dan Williams
2019-08-29  9:32               ` Christoph Hellwig
2019-12-16 18:10               ` Vivek Goyal
2020-01-07 12:51                 ` Christoph Hellwig
2020-01-07 14:22                   ` Dan Williams
2020-01-07 17:07                     ` Darrick J. Wong
2020-01-07 17:29                       ` Dan Williams
2020-01-07 18:01                         ` Vivek Goyal
2020-01-07 18:07                           ` Dan Williams
2020-01-07 18:33                             ` Vivek Goyal
2020-01-07 18:49                               ` Dan Williams
2020-01-07 19:02                                 ` Darrick J. Wong
2020-01-07 19:46                                   ` Dan Williams
2020-01-07 23:38                                     ` Dan Williams
2020-01-09 11:24                                 ` Jan Kara
2020-01-09 20:03                                   ` Dan Williams
2020-01-10 12:36                                     ` Christoph Hellwig
2020-01-14 20:31                                     ` Vivek Goyal
2020-01-14 20:39                                       ` Dan Williams
2020-01-14 21:28                                         ` Vivek Goyal
2020-01-14 22:23                                           ` Dan Williams
2020-01-15 19:56                                             ` Vivek Goyal
2020-01-15 20:17                                               ` Dan Williams
2020-01-15 21:08                                                 ` Jeff Moyer
2020-01-16 18:09                                                   ` Dan Williams
2020-01-16 18:39                                                     ` Vivek Goyal
2020-01-16 19:09                                                       ` Dan Williams
2020-01-16 19:23                                                         ` Vivek Goyal
2020-02-11 17:33                                                     ` Vivek Goyal
2020-01-15  9:03                                           ` Jan Kara
2019-08-21 17:57 ` [PATCH 02/19] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
2019-08-26 11:53   ` Christoph Hellwig
2019-08-26 20:33     ` Vivek Goyal
2019-08-26 20:58       ` Vivek Goyal
2019-08-26 21:33         ` Dan Williams
2019-08-28  6:58         ` Christoph Hellwig
2020-01-03 14:12         ` Vivek Goyal
2020-01-03 18:12           ` Dan Williams
2020-01-03 18:18             ` Dan Williams
2020-01-03 18:33               ` Vivek Goyal
2020-01-03 19:30                 ` Dan Williams
2020-01-03 18:43               ` Vivek Goyal
2019-08-27 13:45       ` Jan Kara
2019-08-21 17:57 ` [PATCH 03/19] virtio: Add get_shm_region method Vivek Goyal
2019-08-21 17:57 ` [PATCH 04/19] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2019-08-26  1:43   ` [Virtio-fs] " piaojun
2019-08-26 13:06     ` Vivek Goyal
2019-08-27  9:41       ` piaojun
2019-08-27  8:34   ` Cornelia Huck
2019-08-27  8:46     ` Cornelia Huck
2019-08-27 11:53     ` Vivek Goyal
2019-08-21 17:57 ` [PATCH 05/19] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2019-08-27  8:39   ` Cornelia Huck
2019-08-27 11:54     ` Vivek Goyal
2019-08-21 17:57 ` [PATCH 06/19] fuse, dax: add fuse_conn->dax_dev field Vivek Goyal
2019-08-21 17:57 ` [PATCH 07/19] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2019-08-21 17:57 ` [PATCH 08/19] fuse: Keep a list of free dax memory ranges Vivek Goyal
2019-08-21 17:57 ` [PATCH 09/19] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
2019-08-21 17:57 ` [PATCH 10/19] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2019-08-21 17:57 ` [PATCH 11/19] fuse, dax: Implement dax read/write operations Vivek Goyal
2019-08-21 19:49   ` Liu Bo
2019-08-22 12:59     ` Vivek Goyal
2019-08-21 17:57 ` [PATCH 12/19] fuse, dax: add DAX mmap support Vivek Goyal
2019-08-21 17:57 ` [PATCH 13/19] fuse: Define dax address space operations Vivek Goyal
2019-08-21 17:57 ` [PATCH 14/19] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2019-08-21 17:57 ` [PATCH 15/19] fuse: Maintain a list of busy elements Vivek Goyal
2019-08-21 17:57 ` [PATCH 16/19] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2019-08-21 17:57 ` [PATCH 17/19] fuse: Add logic to free up a memory range Vivek Goyal
2019-08-21 17:57 ` [PATCH 18/19] fuse: Release file in process context Vivek Goyal
2019-08-21 17:57 ` [PATCH 19/19] fuse: Take inode lock for dax inode truncation Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).