linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink
@ 2020-08-07 13:13 Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink Shiyang Ruan
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Instead of per-page tracking method, this patchset introduces a query
interface: get_shared_files(), which is implemented by each FS, to
obtain the owners of a shared page.  It returns an owner list of this
shared page.  Then, the memory-failure() iterates the list to be able
to notify each process using files that sharing this page.

The design of the tracking method is as follow:
1. dax_assocaite_entry() associates the owner's info to this page
- For non-reflink case:
  page->mapping,->index stores the file's mapping, offset in file.
    A dax page is not shared by other files. dax_associate_entry() is
    called only once.  So, use page->mapping,->index to store the
    owner's info.
- For reflink case:
  page->mapping,->index stores the block device, offset in device.
    A dax page is shared more than once.  So, dax_assocaite_entry()
    will be called more than once.  We introduce page->zone_device_data
    as reflink counter, to indicate that this page is shared and how
    many owners now is using this page. The page->mapping,->index is
    used to store the block_device of the fs and page offset of this
    device.

2. dax_lock_page() calls query interface to lock each dax entry
- For non-reflink case:
  owner's info is stored in page->mapping,->index.
    So, It is easy to lock its dax entry.
- For reflink case:
  owner's info is obtained by calling get_shared_files(), which is
  implemented by FS.
    The FS context could be found in block_device that stored by
    page->mapping.  Then lock the dax entries of the owners.

In memory-failure(), since the owner list has been obtained in 
dax_lock_page(), just iterate the list and handle the error.  This
patchset didn't handle the memory failure on metadata of FS because
I haven't found a way to distinguish whether this page contains
matadata yet.  Still working on it.

==
I also borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.

The rests are dax & reflink support for xfs.

(Rebased on v5.8)
==
Shiyang Ruan (8):
  fs: introduce get_shared_files() for dax&reflink
  fsdax, mm: track files sharing dax page for memory-failure
  fsdax: introduce dax_copy_edges() for COW
  fsdax: copy data before write
  fsdax: replace mmap entry in case of CoW
  fsdax: dedup file range to use a compare function
  fs/xfs: handle CoW for fsdax write() path
  fs/xfs: support dedupe for fsdax

 fs/btrfs/reflink.c     |   3 +-
 fs/dax.c               | 302 +++++++++++++++++++++++++++++++++++------
 fs/ocfs2/file.c        |   2 +-
 fs/read_write.c        |  11 +-
 fs/xfs/xfs_bmap_util.c |   6 +-
 fs/xfs/xfs_file.c      |  10 +-
 fs/xfs/xfs_iomap.c     |   3 +-
 fs/xfs/xfs_iops.c      |  11 +-
 fs/xfs/xfs_reflink.c   |  80 ++++++-----
 fs/xfs/xfs_super.c     |  67 +++++++++
 include/linux/dax.h    |  18 ++-
 include/linux/fs.h     |  11 +-
 include/linux/mm.h     |   8 ++
 mm/memory-failure.c    | 138 ++++++++++++-------
 14 files changed, 529 insertions(+), 141 deletions(-)

-- 
2.27.0




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 16:15   ` Darrick J. Wong
  2020-08-07 13:13 ` [RFC PATCH 2/8] fsdax, mm: track files sharing dax page for memory-failure Shiyang Ruan
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

Under the mode of both dax and reflink on, one page may be shared by
multiple files and offsets.  In order to track them in memory-failure or
other cases, we introduce this function by finding out who is sharing
this block(the page) in a filesystem.  It returns a list that contains
all the owners, and the offset in each owner.

For XFS, rmapbt is used to find out the owners of one block.  So, it
should be turned on when we want to use dax&reflink feature together.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/xfs/xfs_super.c  | 67 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |  7 +++++
 include/linux/fs.h  |  2 ++
 3 files changed, 76 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 379cbff438bc..b71392219c91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,9 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1097,6 +1100,69 @@ xfs_fs_free_cached_objects(
 	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static int _get_shared_files_fn(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct list_head	*list = priv;
+	struct xfs_inode	*ip;
+	struct shared_files	*sfp;
+
+	/* Get files that incore, filter out others that are not in use. */
+	xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE, 0, &ip);
+	if (ip && !ip->i_vnode.i_mapping)
+		return 0;
+
+	sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);
+	sfp->mapping = ip->i_vnode.i_mapping;
+	sfp->index = rec->rm_offset;
+	list_add_tail(&sfp->list, list);
+
+	return 0;
+}
+
+static int
+xfs_fs_get_shared_files(
+	struct super_block	*sb,
+	pgoff_t			offset,
+	struct list_head	*list)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+	struct xfs_trans	*tp = NULL;
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_rmap_irec	rmap_low = { 0 }, rmap_high = { 0 };
+	struct xfs_buf		*agf_bp = NULL;
+	xfs_agblock_t		bno = XFS_B_TO_FSB(mp, offset);
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
+	int			error = 0;
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+	memset(&cur->bc_rec, 0, sizeof(cur->bc_rec));
+	/* Construct the range for one rmap search */
+	memset(&rmap_low, 0, sizeof(rmap_low));
+	memset(&rmap_high, 0xFF, sizeof(rmap_high));
+	rmap_low.rm_startblock = rmap_high.rm_startblock = bno;
+
+	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
+				     _get_shared_files_fn, list);
+	if (error == -ECANCELED)
+		error = 0;
+
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, agf_bp);
+	return error;
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1110,6 +1176,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.get_shared_files	= xfs_fs_get_shared_files,
 };
 
 static int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..0a85e321d6b4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -40,6 +40,13 @@ struct dax_operations {
 
 extern struct attribute_group dax_attribute_group;
 
+struct shared_files {
+	struct list_head	list;
+	struct address_space	*mapping;
+	pgoff_t			index;
+	dax_entry_t		cookie;
+};
+
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *dax_get_by_host(const char *host);
 struct dax_device *alloc_dax(void *private, const char *host,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5abba86107d..81de3d2739b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1977,6 +1977,8 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	int (*get_shared_files)(struct super_block *sb, pgoff_t offset,
+				struct list_head *list);
 };
 
 /*
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/8] fsdax, mm: track files sharing dax page for memory-failure
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 3/8] fsdax: introduce dax_copy_edges() for COW Shiyang Ruan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

When memory-failure occurs on a pmem device which contains a filesystem,
we need to find out which files are in using, and then notify processes
that are using these files to handle the error.

The design of the track method is as follow:
1. dax_assocaite_entry() associates the owner's info to this page
- For non-reflink case:
  page->mapping,->index stores the file's mapping, offset in file.
    A dax page is not shared by other files. dax_associate_entry() is
    called only once.  So, use page->mapping,->index to store the
    owner's info.
- For reflink case:
  page->mapping,->index stores the block device, offset in device.
    A dax page is shared more than once.  So, dax_assocaite_entry()
    will be called more than once.  We introduce page->zone_device_data
    as reflink counter, to indicate that this page is shared and how
    many owners now is using this page. The page->mapping,->index is
    used to store the block_device of the fs and page offset of this
    device.

2. dax_lock_page() calls query interface to lock each dax entry
- For non-reflink case:
  owner's info is stored in page->mapping,->index.
    So, It is easy to lock its dax entry.
- For reflink case:
  owner's info is obtained by calling get_shared_files(), which is
  implemented by FS.
    The FS context could be found in block_device that stored by
    page->mapping.  Then lock the dax entries of the owners.

In memory-failure(), since owners list has been obtained in
dax_lock_page(), just iterate the list and handle the error.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c            | 111 +++++++++++++++++++++++++++--------
 include/linux/dax.h |   6 +-
 include/linux/mm.h  |   8 +++
 mm/memory-failure.c | 138 +++++++++++++++++++++++++++-----------------
 4 files changed, 183 insertions(+), 80 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..47380f75ef38 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -329,7 +329,7 @@ static unsigned long dax_end_pfn(void *entry)
  * offsets.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address)
+		struct vm_fault *vmf, pgoff_t offset)
 {
 	unsigned long size = dax_entry_size(entry), pfn, index;
 	int i = 0;
@@ -337,13 +337,27 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
-	index = linear_page_index(vma, address & ~(size - 1));
+	index = linear_page_index(vmf->vma, vmf->address & ~(size - 1));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(page->mapping);
-		page->mapping = mapping;
-		page->index = index + i++;
+		BUG_ON(!page->mapping && IS_FSDAX_SHARED(page));
+
+		/* Use zone_device_data as reflink counter here.
+		 * If one page is associated for the first time, then use the
+		 * ->mapping,->index as normal.  For the second time it is
+		 * assocated, we store the block_device that this page belongs
+		 * to in ->mapping and the offset within this block_device in
+		 * ->index, and increase the reflink counter.
+		 */
+		if (!page->mapping) {
+			page->mapping = mapping;
+			page->index = index + i++;
+		} else {
+			page->mapping = (struct address_space *)mapping->host->i_sb->s_bdev;
+			page->index = offset;
+			page->zone_device_data++;
+		}
 	}
 }
 
@@ -359,9 +373,12 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
+		if (!IS_FSDAX_SHARED(page)) {
+			page->mapping = NULL;
+			page->index = 0;
+		} else {
+			page->zone_device_data--;
+		}
 	}
 }
 
@@ -386,7 +403,7 @@ static struct page *dax_busy_page(void *entry)
  * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
  * not be locked.
  */
-dax_entry_t dax_lock_page(struct page *page)
+void _dax_lock_page(struct page *page, struct shared_files *sfp)
 {
 	XA_STATE(xas, NULL, 0);
 	void *entry;
@@ -394,7 +411,7 @@ dax_entry_t dax_lock_page(struct page *page)
 	/* Ensure page->mapping isn't freed while we look at it */
 	rcu_read_lock();
 	for (;;) {
-		struct address_space *mapping = READ_ONCE(page->mapping);
+		struct address_space *mapping = READ_ONCE(sfp->mapping);
 
 		entry = NULL;
 		if (!mapping || !dax_mapping(mapping))
@@ -413,11 +430,11 @@ dax_entry_t dax_lock_page(struct page *page)
 
 		xas.xa = &mapping->i_pages;
 		xas_lock_irq(&xas);
-		if (mapping != page->mapping) {
+		if (mapping != sfp->mapping) {
 			xas_unlock_irq(&xas);
 			continue;
 		}
-		xas_set(&xas, page->index);
+		xas_set(&xas, sfp->index);
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			rcu_read_unlock();
@@ -430,18 +447,61 @@ dax_entry_t dax_lock_page(struct page *page)
 		break;
 	}
 	rcu_read_unlock();
-	return (dax_entry_t)entry;
+	sfp->cookie = (dax_entry_t)entry;
+}
+
+int dax_lock_page(struct page *page, struct list_head *shared_files)
+{
+	struct shared_files *sfp;
+
+	if (IS_FSDAX_SHARED(page)) {
+		struct block_device *bdev = (struct block_device *)page->mapping;
+
+		bdev->bd_super->s_op->get_shared_files(bdev->bd_super,
+						page->index, shared_files);
+		list_for_each_entry(sfp, shared_files, list) {
+			_dax_lock_page(page, sfp);
+			if (!sfp->cookie)
+				return 1;
+		}
+	} else {
+		sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);
+		sfp->mapping = page->mapping;
+		sfp->index = page->index;
+		list_add(&sfp->list, shared_files);
+		_dax_lock_page(page, sfp);
+		if (!sfp->cookie)
+			return 1;
+	}
+	return 0;
 }
 
-void dax_unlock_page(struct page *page, dax_entry_t cookie)
+void _dax_unlock_page(struct page *page, struct shared_files *sfp)
 {
-	struct address_space *mapping = page->mapping;
-	XA_STATE(xas, &mapping->i_pages, page->index);
+	struct address_space *mapping = sfp->mapping;
 
+	XA_STATE(xas, &mapping->i_pages, sfp->index);
 	if (S_ISCHR(mapping->host->i_mode))
 		return;
 
-	dax_unlock_entry(&xas, (void *)cookie);
+	dax_unlock_entry(&xas, (void *)sfp->cookie);
+}
+
+void dax_unlock_page(struct page *page, struct list_head *shared_files)
+{
+	struct shared_files *sfp, *next;
+
+	if (IS_FSDAX_SHARED(page)) {
+		list_for_each_entry_safe(sfp, next, shared_files, list) {
+			if (!sfp->cookie)
+				_dax_unlock_page(page, sfp);
+			kfree(sfp);
+		}
+	} else {
+		if (!sfp->cookie)
+			_dax_unlock_page(page, sfp);
+		kfree(sfp);
+	}
 }
 
 /*
@@ -715,7 +775,8 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+		pgoff_t bdoff, void *entry, pfn_t pfn, unsigned long flags,
+		bool dirty)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
 
@@ -738,7 +799,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
+		dax_associate_entry(new_entry, mapping, vmf, bdoff);
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -1030,7 +1091,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+	*entry = dax_insert_entry(xas, mapping, vmf, 0, *entry, pfn,
 			DAX_ZERO_PAGE, false);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
@@ -1337,8 +1398,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						 0, write && !sync);
+		entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
+					 pfn, 0, write && !sync);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
@@ -1418,7 +1479,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+	*entry = dax_insert_entry(xas, mapping, vmf, 0, *entry, pfn,
 			DAX_PMD | DAX_ZERO_PAGE, false);
 
 	if (arch_needs_pgtable_deposit()) {
@@ -1555,8 +1616,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
-		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						DAX_PMD, write && !sync);
+		entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
+					 pfn, DAX_PMD, write && !sync);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0a85e321d6b4..d6d9f4b721bc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,7 +8,7 @@
 
 /* Flag for synchronous flush */
 #define DAXDEV_F_SYNC (1UL << 0)
-
+#define IS_FSDAX_SHARED(p) ((unsigned long)p->zone_device_data != 0)
 typedef unsigned long dax_entry_t;
 
 struct iomap_ops;
@@ -148,8 +148,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
-dax_entry_t dax_lock_page(struct page *page);
-void dax_unlock_page(struct page *page, dax_entry_t cookie);
+int dax_lock_page(struct page *page, struct list_head *shared_files);
+void dax_unlock_page(struct page *page, struct list_head *shared_files);
 #else
 static inline bool bdev_dax_supported(struct block_device *bdev,
 		int blocksize)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..bffccc34f1a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1115,6 +1115,14 @@ static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_fsdax_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+		is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_FS_DAX;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb1fb9b..53cc65821c99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -93,9 +93,13 @@ static int hwpoison_filter_dev(struct page *p)
 	if (PageSlab(p))
 		return -EINVAL;
 
-	mapping = page_mapping(p);
-	if (mapping == NULL || mapping->host == NULL)
-		return -EINVAL;
+	if (is_device_fsdax_page(p) & IS_FSDAX_SHARED(p))
+		dev = ((struct block_device *)p->mapping)->s_dev;
+	else {
+		mapping = page_mapping(p);
+		if (mapping == NULL || mapping->host == NULL)
+			return -EINVAL;
+	}
 
 	dev = mapping->host->i_sb->s_dev;
 	if (hwpoison_filter_dev_major != ~0U &&
@@ -263,9 +267,8 @@ void shake_page(struct page *p, int access)
 EXPORT_SYMBOL_GPL(shake_page);
 
 static unsigned long dev_pagemap_mapping_shift(struct page *page,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long address = vma_address(page, vma);
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -306,8 +309,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-		       struct vm_area_struct *vma,
-		       struct list_head *to_kill)
+		       struct address_space *mapping, pgoff_t offset,
+		       struct vm_area_struct *vma, struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -317,12 +320,18 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
-		tk->size_shift = page_shift(compound_head(p));
-
+	if (is_device_fsdax_page(p)) {
+		tk->addr = vma->vm_start +
+				((offset - vma->vm_pgoff) << PAGE_SHIFT);
+		tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
+	} else {
+		tk->addr = page_address_in_vma(p, vma);
+		if (is_zone_device_page(p)) {
+			tk->size_shift = dev_pagemap_mapping_shift(p, vma,
+							vma_address(p, vma));
+		} else
+			tk->size_shift = page_shift(compound_head(p));
+	}
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
 	 * "tk->size_shift" is always non-zero for !is_zone_device_page(),
@@ -468,33 +477,26 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, NULL, 0, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
 	page_unlock_anon_vma_read(av);
 }
 
-/*
- * Collect processes when the error hit a file mapped page.
- */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-				int force_early)
+static void collect_procs_file_one(struct page *page, struct list_head *to_kill,
+		struct address_space *mapping, pgoff_t pgoff, bool force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
 		struct task_struct *t = task_early_kill(tsk, force_early);
-
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
-				      pgoff) {
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 			/*
 			 * Send early kill signal to tasks where a vma covers
 			 * the page but the corrupted page is not necessarily
@@ -502,19 +504,39 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * Assume applications who requested early kill want
 			 * to be informed of all such data corruptions.
 			 */
-			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+			if (vma->vm_mm == t->mm) {
+				add_to_kill(t, page, mapping, pgoff, vma,
+					    to_kill);
+			}
 		}
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
 }
 
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+		struct list_head *shared_files, int force_early)
+{
+	struct shared_files *sfp;
+
+	if (shared_files)
+		list_for_each_entry(sfp, shared_files, list) {
+			collect_procs_file_one(page, to_kill, sfp->mapping,
+					       sfp->index, force_early);
+		}
+	else
+		collect_procs_file_one(page, to_kill, page->mapping,
+				       page_to_pgoff(page), force_early);
+}
+
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
 static void collect_procs(struct page *page, struct list_head *tokill,
-				int force_early)
+		struct list_head *shared_files, int force_early)
 {
 	if (!page->mapping)
 		return;
@@ -522,7 +544,33 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 	if (PageAnon(page))
 		collect_procs_anon(page, tokill, force_early);
 	else
-		collect_procs_file(page, tokill, force_early);
+		collect_procs_file(page, tokill, shared_files, force_early);
+}
+
+static void unmap_mappings_range(struct list_head *to_kill,
+		struct list_head *shared_files)
+{
+	unsigned long size = 0;
+	loff_t start = 0;
+	struct to_kill *tk;
+	struct shared_files *sfp;
+
+	list_for_each_entry(tk, to_kill, nd)
+		if (tk->size_shift)
+			size = max(size, 1UL << tk->size_shift);
+	if (size) {
+		/*
+		 * Unmap the largest mapping to avoid breaking up
+		 * device-dax mappings which are constant size. The
+		 * actual size of the mapping being torn down is
+		 * communicated in siginfo, see kill_proc()
+		 */
+		list_for_each_entry(sfp, shared_files, list) {
+			start = (sfp->index << PAGE_SHIFT) & ~(size - 1);
+			unmap_mapping_range(sfp->mapping, start,
+					    start + size, 0);
+		}
+	}
 }
 
 static const char *action_name[] = {
@@ -1026,7 +1074,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
+		collect_procs(hpage, &tokill, NULL, flags & MF_ACTION_REQUIRED);
 
 	if (!PageHuge(hpage)) {
 		unmap_success = try_to_unmap(hpage, ttu);
@@ -1181,12 +1229,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 {
 	struct page *page = pfn_to_page(pfn);
 	const bool unmap_success = true;
-	unsigned long size = 0;
-	struct to_kill *tk;
 	LIST_HEAD(tokill);
+	LIST_HEAD(shared_files);
 	int rc = -EBUSY;
-	loff_t start;
-	dax_entry_t cookie;
+	int error = 0;
 
 	/*
 	 * Prevent the inode from being freed while we are interrogating
@@ -1195,9 +1241,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	 * also prevents changes to the mapping of this pfn until
 	 * poison signaling is complete.
 	 */
-	cookie = dax_lock_page(page);
-	if (!cookie)
-		goto out;
+	error = dax_lock_page(page, &shared_files);
+	if (error)
+		goto unlock;
 
 	if (hwpoison_filter(page)) {
 		rc = 0;
@@ -1225,26 +1271,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	 * SIGBUS (i.e. MF_MUST_KILL)
 	 */
 	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
-	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+	collect_procs(page, &tokill, &shared_files, flags & MF_ACTION_REQUIRED);
+
+	unmap_mappings_range(&tokill, &shared_files);
 
-	list_for_each_entry(tk, &tokill, nd)
-		if (tk->size_shift)
-			size = max(size, 1UL << tk->size_shift);
-	if (size) {
-		/*
-		 * Unmap the largest mapping to avoid breaking up
-		 * device-dax mappings which are constant size. The
-		 * actual size of the mapping being torn down is
-		 * communicated in siginfo, see kill_proc()
-		 */
-		start = (page->index << PAGE_SHIFT) & ~(size - 1);
-		unmap_mapping_range(page->mapping, start, start + size, 0);
-	}
 	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
 	rc = 0;
 unlock:
-	dax_unlock_page(page, cookie);
-out:
+	dax_unlock_page(page, &shared_files);
 	/* drop pgmap ref acquired in caller */
 	put_dev_pagemap(pgmap);
 	action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/8] fsdax: introduce dax_copy_edges() for COW
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 2/8] fsdax, mm: track files sharing dax page for memory-failure Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 4/8] fsdax: copy data before write Shiyang Ruan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

Add address output in dax_iomap_pfn() in order to perform a memcpy() in CoW
case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

dax_copy_edges() is a helper functions performs a copy from one part of
the device to another for data not page aligned.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 47380f75ef38..308678c58d4d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1043,8 +1043,8 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
+			 pfn_t *pfnp, void **addr)
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
@@ -1055,12 +1055,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 	if (rc)
 		return rc;
 	id = dax_read_lock();
-	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), addr,
+				   pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
 	}
+	if (!pfnp)
+		goto out_check_addr;
 	rc = -EINVAL;
 	if (PFN_PHYS(length) < size)
 		goto out;
@@ -1070,11 +1072,59 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 	if (length > 1 && !pfn_t_devmap(*pfnp))
 		goto out;
 	rc = 0;
+
+out_check_addr:
+	if (!addr)
+		goto out;
+	if (!*addr)
+		rc = -EFAULT;
 out:
 	dax_read_unlock(id);
 	return rc;
 }
 
+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ *		    the write, but required for CoW because
+ *		    offset/offset+length are not page aligned.
+ */
+static int dax_copy_edges(loff_t pos, loff_t length, struct iomap *srcmap,
+			  void *daddr, bool pmd)
+{
+	size_t page_size = pmd ? PMD_SIZE : PAGE_SIZE;
+	loff_t offset = pos & (page_size - 1);
+	size_t size = ALIGN(offset + length, page_size);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, page_size);
+	void *saddr = 0;
+	int ret = 0;
+
+	ret = dax_iomap_direct_access(srcmap, pos, size, NULL, &saddr);
+	if (ret)
+		return ret;
+	/*
+	 * Copy the first part of the page
+	 * Note: we pass offset as length
+	 */
+	if (offset) {
+		if (saddr)
+			ret = memcpy_mcsafe(daddr, saddr, offset);
+		else
+			memset(daddr, 0, offset);
+	}
+
+	/* Copy the last part of the range */
+	if (end < pg_end) {
+		if (saddr)
+			ret = memcpy_mcsafe(daddr + offset + length,
+			       saddr + offset + length,	pg_end - end);
+		else
+			memset(daddr + offset + length, 0,
+					pg_end - end);
+	}
+	return ret;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1394,7 +1444,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
-		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+		error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
+						NULL);
 		if (error < 0)
 			goto error_finish_iomap;
 
@@ -1612,7 +1663,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
-		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+		error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
+						NULL);
 		if (error < 0)
 			goto finish_iomap;
 
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/8] fsdax: copy data before write
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (2 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 3/8] fsdax: introduce dax_copy_edges() for COW Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 5/8] fsdax: replace mmap entry in case of CoW Shiyang Ruan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

Add dax_copy_edges() into each dax actor functions to perform CoW.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 308678c58d4d..65553e3f7602 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1208,7 +1208,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+			!(iomap->flags & IOMAP_F_SHARED)))
 		return -EIO;
 
 	/*
@@ -1247,6 +1248,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
+		if (iomap != srcmap) {
+			ret = dax_copy_edges(pos, length, srcmap, kaddr, false);
+			if (ret)
+				break;
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
@@ -1358,6 +1365,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	vm_fault_t ret = 0;
 	void *entry;
 	pfn_t pfn;
+	void *kaddr;
 
 	trace_dax_pte_fault(inode, vmf, ret);
 	/*
@@ -1439,19 +1447,27 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
+cow:
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
 		error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
-						NULL);
+						&kaddr);
 		if (error < 0)
 			goto error_finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
 					 pfn, 0, write && !sync);
 
+		if (srcmap.type != IOMAP_HOLE) {
+			error = dax_copy_edges(pos, PAGE_SIZE, &srcmap, kaddr,
+					       false);
+			if (error)
+				goto error_finish_iomap;
+		}
+
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
 		 * we can insert PTE into page tables only after that happens.
@@ -1475,12 +1491,15 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
+		if (write && iomap.flags & IOMAP_F_SHARED)
+			goto cow;
+		fallthrough;
 	case IOMAP_HOLE:
 		if (!write) {
 			ret = dax_load_hole(&xas, mapping, &entry, vmf);
 			goto finish_iomap;
 		}
-		/*FALLTHRU*/
+		fallthrough;
 	default:
 		WARN_ON_ONCE(1);
 		error = -EIO;
@@ -1582,6 +1601,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	loff_t pos;
 	int error;
 	pfn_t pfn;
+	void *kaddr;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1663,14 +1683,22 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
+cow:
 		error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
-						NULL);
+						&kaddr);
 		if (error < 0)
 			goto finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, iomap.addr, entry,
 					 pfn, DAX_PMD, write && !sync);
 
+		if (srcmap.type != IOMAP_HOLE) {
+			error = dax_copy_edges(pos, PMD_SIZE, &srcmap, kaddr,
+					       true);
+			if (error)
+				goto unlock_entry;
+		}
+
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
 		 * we can insert PMD into page tables only after that happens.
@@ -1689,6 +1717,9 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		result = vmf_insert_pfn_pmd(vmf, pfn, write);
 		break;
 	case IOMAP_UNWRITTEN:
+		if (write && iomap.flags & IOMAP_F_SHARED)
+			goto cow;
+		fallthrough;
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
 			break;
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 5/8] fsdax: replace mmap entry in case of CoW
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (3 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 4/8] fsdax: copy data before write Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 6/8] fsdax: dedup file range to use a compare function Shiyang Ruan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 65553e3f7602..28b8e23b11ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -766,6 +766,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 	return 0;
 }
 
+#define DAX_IF_DIRTY		(1ULL << 0)
+#define DAX_IF_COW		(1ULL << 1)
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -776,14 +779,16 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
 		pgoff_t bdoff, void *entry, pfn_t pfn, unsigned long flags,
-		bool dirty)
+		bool insert_flags)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = insert_flags & DAX_IF_DIRTY;
+	bool cow = insert_flags & DAX_IF_COW;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -795,7 +800,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
@@ -819,6 +824,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1366,6 +1374,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	void *entry;
 	pfn_t pfn;
 	void *kaddr;
+	unsigned long insert_flags = 0;
 
 	trace_dax_pte_fault(inode, vmf, ret);
 	/*
@@ -1491,8 +1500,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
-		if (write && iomap.flags & IOMAP_F_SHARED)
+		if (write && (iomap.flags & IOMAP_F_SHARED)) {
+			insert_flags |= DAX_IF_COW;
 			goto cow;
+		}
 		fallthrough;
 	case IOMAP_HOLE:
 		if (!write) {
@@ -1602,6 +1613,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	int error;
 	pfn_t pfn;
 	void *kaddr;
+	unsigned long insert_flags = 0;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1717,14 +1729,17 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		result = vmf_insert_pfn_pmd(vmf, pfn, write);
 		break;
 	case IOMAP_UNWRITTEN:
-		if (write && iomap.flags & IOMAP_F_SHARED)
+		if (write && (iomap.flags & IOMAP_F_SHARED)) {
+			insert_flags |= DAX_IF_COW;
 			goto cow;
+		}
 		fallthrough;
 	case IOMAP_HOLE:
-		if (WARN_ON_ONCE(write))
+		if (!write) {
+			result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
 			break;
-		result = dax_pmd_load_hole(&xas, vmf, &iomap, &entry);
-		break;
+		}
+		fallthrough;
 	default:
 		WARN_ON_ONCE(1);
 		break;
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 6/8] fsdax: dedup file range to use a compare function
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (4 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 5/8] fsdax: replace mmap entry in case of CoW Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.

This may not be the best way to solve this. Suggestions welcome.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/btrfs/reflink.c   |  3 +-
 fs/dax.c             | 67 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/file.c      |  2 +-
 fs/read_write.c      | 11 ++++----
 fs/xfs/xfs_reflink.c |  2 +-
 include/linux/dax.h  |  5 ++++
 include/linux/fs.h   |  9 +++++-
 7 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 040009d1cc31..d0412bc338da 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -765,7 +765,8 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 		return ret;
 
 	return generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-					    len, remap_flags);
+					    len, remap_flags,
+					    vfs_dedupe_file_range_compare);
 }
 
 loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
diff --git a/fs/dax.c b/fs/dax.c
index 28b8e23b11ac..80a9946fd25a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/fs_dax.h>
 
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
 static inline unsigned int pe_order(enum page_entry_size pe_size)
 {
 	if (pe_size == PE_SIZE_PTE)
@@ -1874,3 +1876,68 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+		loff_t destoff, loff_t len, bool *is_same,
+		const struct iomap_ops *ops)
+{
+	void *saddr, *daddr;
+	struct iomap smap = { 0 };
+	struct iomap dmap = { 0 };
+	bool same = false;
+	loff_t cmp_len;
+	int id, ret = 0;
+
+	id = dax_read_lock();
+	while (len) {
+		ret = ops->iomap_begin(src, srcoff, len, 0, &smap, NULL);
+		if (ret < 0)
+			goto out_src;
+		cmp_len = MIN(len, smap.offset + smap.length - srcoff);
+
+		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &dmap, NULL);
+		if (ret < 0)
+			goto out_dest;
+		cmp_len = MIN(cmp_len, dmap.offset + dmap.length - destoff);
+
+		if (smap.type == IOMAP_HOLE && dmap.type == IOMAP_HOLE)
+			goto next;
+
+		if (smap.type == IOMAP_HOLE || dmap.type == IOMAP_HOLE) {
+			same = false;
+			goto next;
+		}
+
+		ret = dax_iomap_direct_access(&smap, srcoff,
+			  ALIGN(srcoff + cmp_len, PAGE_SIZE), NULL, &saddr);
+		if (ret < 0)
+			goto out_dest;
+
+		ret = dax_iomap_direct_access(&dmap, destoff,
+			  ALIGN(destoff + cmp_len, PAGE_SIZE), NULL, &daddr);
+		if (ret < 0)
+			goto out_dest;
+
+		same = !memcmp(saddr, daddr, cmp_len);
+		if (!same)
+			break;
+next:
+		len -= cmp_len;
+		srcoff += cmp_len;
+		destoff += cmp_len;
+out_dest:
+		if (ops->iomap_end)
+			ops->iomap_end(dest, destoff, cmp_len, 0, 0, &dmap);
+out_src:
+		if (ops->iomap_end)
+			ops->iomap_end(src, srcoff, len, 0, 0, &smap);
+
+		if (ret < 0)
+			goto out;
+	}
+	*is_same = same;
+out:
+	dax_read_unlock(id);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 85979e2214b3..9d101f129d16 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2591,7 +2591,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			&len, remap_flags);
+			&len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || len == 0)
 		goto out_unlock;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index 4fb797822567..2974a624f232 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1927,7 +1927,7 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
 					 loff_t len, bool *is_same)
 {
@@ -2008,6 +2008,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 out_error:
 	return error;
 }
+EXPORT_SYMBOL_GPL(vfs_dedupe_file_range_compare);
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -2019,7 +2020,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+				  loff_t *len, unsigned int remap_flags,
+				  compare_range_t compare)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -2078,9 +2080,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	 */
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
-
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+		ret = (*compare)(inode_in, pos_in,
+			inode_out, pos_out, *len, &is_same);
 		if (ret)
 			return ret;
 		if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..792217cd1e64 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1338,7 +1338,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+			len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d6d9f4b721bc..607e21933928 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -223,6 +223,11 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
 			struct iomap *iomap);
+int dax_file_range_compare(struct inode *src, loff_t srcoff,
+			   struct inode *dest, loff_t destoff,
+			   loff_t len, bool *is_same,
+			   const struct iomap_ops *ops);
+
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 81de3d2739b9..c2d985295db5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1924,13 +1924,20 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					 struct inode *dest, loff_t destoff,
+					 loff_t len, bool *is_same);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+			       struct inode *dest, loff_t destpos,
+			       loff_t len, bool *is_same);
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 				       struct file *file_out, loff_t pos_out,
 				       size_t len, unsigned int flags);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
-					 unsigned int remap_flags);
+					 unsigned int remap_flags,
+					 compare_range_t cmp);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (5 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 6/8] fsdax: dedup file range to use a compare function Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:13 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
  2020-08-07 13:38 ` [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Matthew Wilcox
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

In fsdax mode, WRITE and ZERO on a shared extent need CoW mechanism
performed.  After CoW, new extents needs to be remapped to the file.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/xfs/xfs_bmap_util.c |  6 +++++-
 fs/xfs/xfs_file.c      | 10 +++++++---
 fs/xfs/xfs_iomap.c     |  3 ++-
 fs/xfs/xfs_iops.c      | 11 ++++++++---
 fs/xfs/xfs_reflink.c   |  2 ++
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f37f5cc4b19f..5d09d6c454b6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -969,10 +969,14 @@ xfs_free_file_space(
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
 	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+		  IS_DAX(VFS_I(ip)) ?
+		  &xfs_direct_write_iomap_ops : &xfs_buffered_write_iomap_ops);
 	if (error)
 		return error;
 
+	if (xfs_is_reflink_inode(ip))
+		xfs_reflink_end_cow(ip, offset, len);
+
 	/*
 	 * If we zeroed right up to EOF and EOF straddles a page boundary we
 	 * must make sure that the post-EOF area is also zeroed because the
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..45041913129b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -588,9 +588,13 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
-	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
-		i_size_write(inode, iocb->ki_pos);
-		error = xfs_setfilesize(ip, pos, ret);
+	if (ret > 0) {
+		if (iocb->ki_pos > i_size_read(inode)) {
+			i_size_write(inode, iocb->ki_pos);
+			error = xfs_setfilesize(ip, pos, ret);
+		}
+		if (xfs_is_cow_inode(ip))
+			xfs_reflink_end_cow(ip, pos, ret);
 	}
 out:
 	xfs_iunlock(ip, iolock);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b9a8c3798e08..a1fc75f11cf9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -748,13 +748,14 @@ xfs_direct_write_iomap_begin(
 		goto out_unlock;
 
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
+		bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
 			goto out_unlock;
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode, need_convert);
 		if (error)
 			goto out_unlock;
 		if (shared)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 80a13c8561d8..6dd6a973ea75 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -860,6 +860,7 @@ xfs_setattr_size(
 	int			error;
 	uint			lock_flags = 0;
 	bool			did_zeroing = false;
+	const struct iomap_ops	*ops;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -906,13 +907,17 @@ xfs_setattr_size(
 	 * extension, or zeroing out the rest of the block on a downward
 	 * truncate.
 	 */
+	if (IS_DAX(inode))
+		ops = &xfs_direct_write_iomap_ops;
+	else
+		ops = &xfs_buffered_write_iomap_ops;
+
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+				&did_zeroing, ops);
 	} else {
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = iomap_truncate_page(inode, newsize, &did_zeroing, ops);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 792217cd1e64..f87ab78dd421 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1269,6 +1269,8 @@ xfs_reflink_zero_posteof(
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
 	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+			IS_DAX(VFS_I(ip)) ?
+			&xfs_direct_write_iomap_ops :
 			&xfs_buffered_write_iomap_ops);
 }
 
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (6 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
@ 2020-08-07 13:13 ` Shiyang Ruan
  2020-08-07 13:38 ` [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Matthew Wilcox
  8 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-08-07 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david,
	hch, rgoldwyn, qi.fuli, y-goto

Use xfs_break_layouts() to break files' layouts when locking them.  And
call dax_file_range_compare() function to compare range for files both
have DAX flag.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/xfs/xfs_reflink.c | 78 ++++++++++++++++++++++++++------------------
 1 file changed, 46 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f87ab78dd421..b2901ad1a269 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_sb.h"
 #include "xfs_ag_resv.h"
+#include <linux/dax.h>
 
 /*
  * Copy on Write of Shared Blocks
@@ -1185,47 +1186,41 @@ xfs_reflink_remap_blocks(
  * back out both locks.
  */
 static int
-xfs_iolock_two_inodes_and_break_layout(
-	struct inode		*src,
-	struct inode		*dest)
+xfs_reflink_remap_lock_and_break_layouts(
+	struct file		*file_in,
+	struct file		*file_out)
 {
 	int			error;
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-	if (src > dest)
+	if (inode_in > inode_out) {
+		swap(inode_in, inode_out);
 		swap(src, dest);
-
-retry:
-	/* Wait to break both inodes' layouts before we start locking. */
-	error = break_layout(src, true);
-	if (error)
-		return error;
-	if (src != dest) {
-		error = break_layout(dest, true);
-		if (error)
-			return error;
 	}
 
-	/* Lock one inode and make sure nobody got in and leased it. */
-	inode_lock(src);
-	error = break_layout(src, false);
+	inode_lock(inode_in);
+	xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_layouts(inode_in, &iolock, BREAK_UNMAP);
+	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
 	if (error) {
-		inode_unlock(src);
-		if (error == -EWOULDBLOCK)
-			goto retry;
+		inode_unlock(inode_in);
 		return error;
 	}
 
-	if (src == dest)
+	if (inode_in == inode_out)
 		return 0;
 
-	/* Lock the other inode and make sure nobody got in and leased it. */
-	inode_lock_nested(dest, I_MUTEX_NONDIR2);
-	error = break_layout(dest, false);
+	inode_lock_nested(inode_out, I_MUTEX_NONDIR2);
+	xfs_ilock(dest, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_layouts(inode_out, &iolock, BREAK_UNMAP);
+	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
 	if (error) {
-		inode_unlock(src);
-		inode_unlock(dest);
-		if (error == -EWOULDBLOCK)
-			goto retry;
+		inode_unlock(inode_in);
+		inode_unlock(inode_out);
 		return error;
 	}
 
@@ -1244,6 +1239,11 @@ xfs_reflink_remap_unlock(
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
 
+	if (inode_in > inode_out) {
+		swap(inode_in, inode_out);
+		swap(src, dest);
+	}
+
 	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
 	if (!same_inode)
 		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
@@ -1274,6 +1274,14 @@ xfs_reflink_zero_posteof(
 			&xfs_buffered_write_iomap_ops);
 }
 
+int xfs_reflink_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					  struct inode *dest, loff_t destoff,
+					  loff_t len, bool *is_same)
+{
+	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same,
+				      &xfs_read_iomap_ops);
+}
+
 /*
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file will
@@ -1318,9 +1326,10 @@ xfs_reflink_remap_prep(
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
 	ssize_t			ret;
+	compare_range_t		cmp;
 
 	/* Lock both files against IO */
-	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	ret = xfs_reflink_remap_lock_and_break_layouts(file_in, file_out);
 	if (ret)
 		return ret;
 	if (same_inode)
@@ -1335,12 +1344,17 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
-	/* Don't share DAX file data for now. */
-	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+	/* Don't share DAX file data with non-DAX file. */
+	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
 
+	if (IS_DAX(inode_in))
+		cmp = xfs_reflink_dedupe_file_range_compare;
+	else
+		cmp = vfs_dedupe_file_range_compare;
+
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags, vfs_dedupe_file_range_compare);
+			len, remap_flags, cmp);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink
  2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
                   ` (7 preceding siblings ...)
  2020-08-07 13:13 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
@ 2020-08-07 13:38 ` Matthew Wilcox
  2020-08-10  8:15   ` Ruan Shiyang
  8 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-08-07 13:38 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	darrick.wong, dan.j.williams, david, hch, rgoldwyn, qi.fuli,
	y-goto

On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Instead of per-page tracking method, this patchset introduces a query
> interface: get_shared_files(), which is implemented by each FS, to
> obtain the owners of a shared page.  It returns an owner list of this
> shared page.  Then, the memory-failure() iterates the list to be able
> to notify each process using files that sharing this page.
> 
> The design of the tracking method is as follow:
> 1. dax_assocaite_entry() associates the owner's info to this page

I think that's the first problem with this design.  dax_associate_entry is
a horrendous idea which needs to be ripped out, not made more important.
It's all part of the general problem of trying to do something on a
per-page basis instead of per-extent basis.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink
  2020-08-07 13:13 ` [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink Shiyang Ruan
@ 2020-08-07 16:15   ` Darrick J. Wong
  2020-08-10  8:07     ` Ruan Shiyang
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2020-08-07 16:15 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, rgoldwyn, qi.fuli, y-goto

On Fri, Aug 07, 2020 at 09:13:29PM +0800, Shiyang Ruan wrote:
> Under the mode of both dax and reflink on, one page may be shared by
> multiple files and offsets.  In order to track them in memory-failure or
> other cases, we introduce this function by finding out who is sharing
> this block(the page) in a filesystem.  It returns a list that contains
> all the owners, and the offset in each owner.
> 
> For XFS, rmapbt is used to find out the owners of one block.  So, it
> should be turned on when we want to use dax&reflink feature together.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> ---
>  fs/xfs/xfs_super.c  | 67 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |  7 +++++
>  include/linux/fs.h  |  2 ++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 379cbff438bc..b71392219c91 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,9 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1097,6 +1100,69 @@ xfs_fs_free_cached_objects(
>  	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int _get_shared_files_fn(

Needs an xfs_ prefix...

> +	struct xfs_btree_cur	*cur,
> +	struct xfs_rmap_irec	*rec,
> +	void			*priv)
> +{
> +	struct list_head	*list = priv;
> +	struct xfs_inode	*ip;
> +	struct shared_files	*sfp;
> +
> +	/* Get files that incore, filter out others that are not in use. */
> +	xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE, 0, &ip);

No error checking at all?

What if rm_owner refers to metadata?

> +	if (ip && !ip->i_vnode.i_mapping)
> +		return 0;

When is the xfs_inode released?  We don't iput it here, and there's no
way for dax_unlock_page (afaict the only consumer) to do it, so we
leak the reference.

> +
> +	sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);

If there are millions of open files reflinked to this range of pmem this
is going to allocate a /lot/ of memory.

> +	sfp->mapping = ip->i_vnode.i_mapping;

sfp->mapping = VFS_I(ip)->i_mapping;

> +	sfp->index = rec->rm_offset;
> +	list_add_tail(&sfp->list, list);

Why do we leave ->cookie uninitialized?  What does it even do?

> +
> +	return 0;
> +}
> +
> +static int
> +xfs_fs_get_shared_files(
> +	struct super_block	*sb,
> +	pgoff_t			offset,

Which device does this offset refer to?  XFS supports multiple storage
devices.

Also, uh, is this really a pgoff_t?  If yes, you can't use it with
XFS_B_TO_FSB below without first converting it to a loff_t.

> +	struct list_head	*list)
> +{
> +	struct xfs_mount	*mp = XFS_M(sb);
> +	struct xfs_trans	*tp = NULL;
> +	struct xfs_btree_cur	*cur = NULL;
> +	struct xfs_rmap_irec	rmap_low = { 0 }, rmap_high = { 0 };

No need to memset(0) rmap_low later, or zero rmap_high just to memset it
later.

> +	struct xfs_buf		*agf_bp = NULL;
> +	xfs_agblock_t		bno = XFS_B_TO_FSB(mp, offset);

"FSB" refers to xfs_fsblock_t.  You just ripped the upper 32 bits off
the fsblock number.

> +	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
> +	int			error = 0;
> +
> +	error = xfs_trans_alloc_empty(mp, &tp);
> +	if (error)
> +		return error;
> +
> +	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
> +
> +	memset(&cur->bc_rec, 0, sizeof(cur->bc_rec));

Not necessary, bc_rec is zero in a freshly created cursor.

> +	/* Construct the range for one rmap search */
> +	memset(&rmap_low, 0, sizeof(rmap_low));
> +	memset(&rmap_high, 0xFF, sizeof(rmap_high));
> +	rmap_low.rm_startblock = rmap_high.rm_startblock = bno;
> +
> +	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
> +				     _get_shared_files_fn, list);
> +	if (error == -ECANCELED)
> +		error = 0;
> +
> +	xfs_btree_del_cursor(cur, error);
> +	xfs_trans_brelse(tp, agf_bp);
> +	return error;
> +}

Looking at this, I don't think this is the right way to approach memory
poisoning.  Rather than allocating a (potentially huge) linked list and
passing it to the memory poison code to unmap pages, kill processes, and
free the list, I think:

1) "->get_shared_files" should be more targetted.  Call it ->storage_lost
or something, so that it only has one purpose, which is to react to
asynchronous notifications that storage has been lost.

2) The inner _get_shared_files_fn should directly call back into the
memory manager to remove a poisoned page from the mapping and signal
whatever process might have it mapped.

That way, _get_shared_files_fn can look in the xfs buffer cache to see
if we have a copy in DRAM, and immediately write it back to pmem.

Hmm and now that you've gotten me rambling about hwpoison, I wonder what
happens if dram backing part of the xfs buffer cache goes bad...

--D

> +
>  static const struct super_operations xfs_super_operations = {
>  	.alloc_inode		= xfs_fs_alloc_inode,
>  	.destroy_inode		= xfs_fs_destroy_inode,
> @@ -1110,6 +1176,7 @@ static const struct super_operations xfs_super_operations = {
>  	.show_options		= xfs_fs_show_options,
>  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
> +	.get_shared_files	= xfs_fs_get_shared_files,
>  };
>  
>  static int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 6904d4e0b2e0..0a85e321d6b4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -40,6 +40,13 @@ struct dax_operations {
>  
>  extern struct attribute_group dax_attribute_group;
>  
> +struct shared_files {
> +	struct list_head	list;
> +	struct address_space	*mapping;
> +	pgoff_t			index;
> +	dax_entry_t		cookie;
> +};
> +
>  #if IS_ENABLED(CONFIG_DAX)
>  struct dax_device *dax_get_by_host(const char *host);
>  struct dax_device *alloc_dax(void *private, const char *host,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f5abba86107d..81de3d2739b9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1977,6 +1977,8 @@ struct super_operations {
>  				  struct shrink_control *);
>  	long (*free_cached_objects)(struct super_block *,
>  				    struct shrink_control *);
> +	int (*get_shared_files)(struct super_block *sb, pgoff_t offset,
> +				struct list_head *list);
>  };
>  
>  /*
> -- 
> 2.27.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink
  2020-08-07 16:15   ` Darrick J. Wong
@ 2020-08-10  8:07     ` Ruan Shiyang
  0 siblings, 0 replies; 15+ messages in thread
From: Ruan Shiyang @ 2020-08-10  8:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, rgoldwyn, qi.fuli, y-goto



On 2020/8/8 上午12:15, Darrick J. Wong wrote:
> On Fri, Aug 07, 2020 at 09:13:29PM +0800, Shiyang Ruan wrote:
>> Under the mode of both dax and reflink on, one page may be shared by
>> multiple files and offsets.  In order to track them in memory-failure or
>> other cases, we introduce this function by finding out who is sharing
>> this block(the page) in a filesystem.  It returns a list that contains
>> all the owners, and the offset in each owner.
>>
>> For XFS, rmapbt is used to find out the owners of one block.  So, it
>> should be turned on when we want to use dax&reflink feature together.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
>> ---
>>   fs/xfs/xfs_super.c  | 67 +++++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/dax.h |  7 +++++
>>   include/linux/fs.h  |  2 ++
>>   3 files changed, 76 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 379cbff438bc..b71392219c91 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -35,6 +35,9 @@
>>   #include "xfs_refcount_item.h"
>>   #include "xfs_bmap_item.h"
>>   #include "xfs_reflink.h"
>> +#include "xfs_alloc.h"
>> +#include "xfs_rmap.h"
>> +#include "xfs_rmap_btree.h"
>>   
>>   #include <linux/magic.h>
>>   #include <linux/fs_context.h>
>> @@ -1097,6 +1100,69 @@ xfs_fs_free_cached_objects(
>>   	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>>   }
>>   
>> +static int _get_shared_files_fn(
> 
> Needs an xfs_ prefix...
> 
>> +	struct xfs_btree_cur	*cur,
>> +	struct xfs_rmap_irec	*rec,
>> +	void			*priv)
>> +{
>> +	struct list_head	*list = priv;
>> +	struct xfs_inode	*ip;
>> +	struct shared_files	*sfp;
>> +
>> +	/* Get files that incore, filter out others that are not in use. */
>> +	xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE, 0, &ip);
> 
> No error checking at all?
> 
> What if rm_owner refers to metadata?

Yes, we need to check whether the page contains metadata.  I remembered 
that.  I wrote the check code but removed it in this patch, because I 
didn't find a way to associate this block device with the dax page that 
contains metadata.  We can call dax_associate_entry() to create the 
association if the page's owner is a file, but it's not work for 
metadata.  I should have explained here.

> 
>> +	if (ip && !ip->i_vnode.i_mapping)
>> +		return 0;
> 
> When is the xfs_inode released?  We don't iput it here, and there's no
> way for dax_unlock_page (afaict the only consumer) to do it, so we
> leak the reference.

Do you mean xfs_irele() ?  Sorry, I didn't realize that.  All we need is 
get the ->mapping form a given inode number.  So, I think we can call 
xfs_irele() when exiting this function.

> 
>> +
>> +	sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);
> 
> If there are millions of open files reflinked to this range of pmem this
> is going to allocate a /lot/ of memory.
> 
>> +	sfp->mapping = ip->i_vnode.i_mapping;
> 
> sfp->mapping = VFS_I(ip)->i_mapping;
> 
>> +	sfp->index = rec->rm_offset;
>> +	list_add_tail(&sfp->list, list);
> 
> Why do we leave ->cookie uninitialized?  What does it even do?

It's my fault.  ->cookie should have been added in the next patch.  It 
stores each owner's dax entry when calling dax_lock_page() in 
memory-failure.

> 
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +xfs_fs_get_shared_files(
>> +	struct super_block	*sb,
>> +	pgoff_t			offset,
> 
> Which device does this offset refer to?  XFS supports multiple storage
> devices.
> 
> Also, uh, is this really a pgoff_t?  If yes, you can't use it with
> XFS_B_TO_FSB below without first converting it to a loff_t.

The offset here is assigned as iomap->addr, which is obtained from 
iomap_begin().  So that we can easily looking up for the owners of this 
offset.

I don't quite understand what you said about supporting multiple storage 
devices.  What are these devices?  Do you mean NVDIMM, HDD, SSD and others?

> 
>> +	struct list_head	*list)
>> +{
>> +	struct xfs_mount	*mp = XFS_M(sb);
>> +	struct xfs_trans	*tp = NULL;
>> +	struct xfs_btree_cur	*cur = NULL;
>> +	struct xfs_rmap_irec	rmap_low = { 0 }, rmap_high = { 0 };
> 
> No need to memset(0) rmap_low later, or zero rmap_high just to memset it
> later.
> 
>> +	struct xfs_buf		*agf_bp = NULL;
>> +	xfs_agblock_t		bno = XFS_B_TO_FSB(mp, offset);
> 
> "FSB" refers to xfs_fsblock_t.  You just ripped the upper 32 bits off
> the fsblock number.

I think I misused these types.  Sorry for that.
> 
>> +	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
>> +	int			error = 0;
>> +
>> +	error = xfs_trans_alloc_empty(mp, &tp);
>> +	if (error)
>> +		return error;
>> +
>> +	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
>> +	if (error)
>> +		return error;
>> +
>> +	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
>> +
>> +	memset(&cur->bc_rec, 0, sizeof(cur->bc_rec));
> 
> Not necessary, bc_rec is zero in a freshly created cursor.
> 
>> +	/* Construct the range for one rmap search */
>> +	memset(&rmap_low, 0, sizeof(rmap_low));
>> +	memset(&rmap_high, 0xFF, sizeof(rmap_high));
>> +	rmap_low.rm_startblock = rmap_high.rm_startblock = bno;
>> +
>> +	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
>> +				     _get_shared_files_fn, list);
>> +	if (error == -ECANCELED)
>> +		error = 0;
>> +
>> +	xfs_btree_del_cursor(cur, error);
>> +	xfs_trans_brelse(tp, agf_bp);
>> +	return error;
>> +}
> 
> Looking at this, I don't think this is the right way to approach memory
> poisoning.  Rather than allocating a (potentially huge) linked list and
> passing it to the memory poison code to unmap pages, kill processes, and
> free the list, I think:
> 
> 1) "->get_shared_files" should be more targetted.  Call it ->storage_lost
> or something, so that it only has one purpose, which is to react to
> asynchronous notifications that storage has been lost.

Yes, it's better.  I only considered file tracking.
> 
> 2) The inner _get_shared_files_fn should directly call back into the
> memory manager to remove a poisoned page from the mapping and signal
> whatever process might have it mapped.

For the error handling part, it's really reasonable.  But we have to 
call dax_lock_page() at the beginning of memory-failure, which also need 
to iterate all the owners in order to lock their dax entries.  I think 
it not supposed to call the lock in each iteration instead of at the 
beginning.

> 
> That way, _get_shared_files_fn can look in the xfs buffer cache to see
> if we have a copy in DRAM, and immediately write it back to pmem.

I didn't think of fixing the storage device.  Will take it into 
consideration.

> 
> Hmm and now that you've gotten me rambling about hwpoison, I wonder what
> happens if dram backing part of the xfs buffer cache goes bad...

Yes, so many possible situations to consider.  For the current stage, 
just shutdown the filesystem if memory failures on metadata, and kill 
user processes if failures on normal files.  Is that OK?


Anyway, thanks for reviewing.

--
Thanks,
Ruan Shiyang.

> 
> --D
> 
>> +
>>   static const struct super_operations xfs_super_operations = {
>>   	.alloc_inode		= xfs_fs_alloc_inode,
>>   	.destroy_inode		= xfs_fs_destroy_inode,
>> @@ -1110,6 +1176,7 @@ static const struct super_operations xfs_super_operations = {
>>   	.show_options		= xfs_fs_show_options,
>>   	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>>   	.free_cached_objects	= xfs_fs_free_cached_objects,
>> +	.get_shared_files	= xfs_fs_get_shared_files,
>>   };
>>   
>>   static int
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 6904d4e0b2e0..0a85e321d6b4 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -40,6 +40,13 @@ struct dax_operations {
>>   
>>   extern struct attribute_group dax_attribute_group;
>>   
>> +struct shared_files {
>> +	struct list_head	list;
>> +	struct address_space	*mapping;
>> +	pgoff_t			index;
>> +	dax_entry_t		cookie;
>> +};
>> +
>>   #if IS_ENABLED(CONFIG_DAX)
>>   struct dax_device *dax_get_by_host(const char *host);
>>   struct dax_device *alloc_dax(void *private, const char *host,
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index f5abba86107d..81de3d2739b9 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1977,6 +1977,8 @@ struct super_operations {
>>   				  struct shrink_control *);
>>   	long (*free_cached_objects)(struct super_block *,
>>   				    struct shrink_control *);
>> +	int (*get_shared_files)(struct super_block *sb, pgoff_t offset,
>> +				struct list_head *list);
>>   };
>>   
>>   /*
>> -- 
>> 2.27.0
>>
>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink
  2020-08-07 13:38 ` [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Matthew Wilcox
@ 2020-08-10  8:15   ` Ruan Shiyang
  2020-08-10 11:16     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Ruan Shiyang @ 2020-08-10  8:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	darrick.wong, dan.j.williams, david, hch, rgoldwyn, qi.fuli,
	y-goto



On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
>> This patchset is a try to resolve the problem of tracking shared page
>> for fsdax.
>>
>> Instead of per-page tracking method, this patchset introduces a query
>> interface: get_shared_files(), which is implemented by each FS, to
>> obtain the owners of a shared page.  It returns an owner list of this
>> shared page.  Then, the memory-failure() iterates the list to be able
>> to notify each process using files that sharing this page.
>>
>> The design of the tracking method is as follow:
>> 1. dax_assocaite_entry() associates the owner's info to this page
> 
> I think that's the first problem with this design.  dax_associate_entry is
> a horrendous idea which needs to be ripped out, not made more important.
> It's all part of the general problem of trying to do something on a
> per-page basis instead of per-extent basis.
> 

The memory-failure needs to track owners info from a dax page, so I 
should associate the owner with this page.  In this version, I associate 
the block device to the dax page, so that the memory-failure is able to 
iterate the owners by the query interface provided by filesystem.


--
Thanks,
Ruan Shiyang.
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink
  2020-08-10  8:15   ` Ruan Shiyang
@ 2020-08-10 11:16     ` Matthew Wilcox
  2020-08-10 21:10       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-08-10 11:16 UTC (permalink / raw)
  To: Ruan Shiyang
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	darrick.wong, dan.j.williams, david, hch, rgoldwyn, qi.fuli,
	y-goto

On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > This patchset is a try to resolve the problem of tracking shared page
> > > for fsdax.
> > > 
> > > Instead of per-page tracking method, this patchset introduces a query
> > > interface: get_shared_files(), which is implemented by each FS, to
> > > obtain the owners of a shared page.  It returns an owner list of this
> > > shared page.  Then, the memory-failure() iterates the list to be able
> > > to notify each process using files that sharing this page.
> > > 
> > > The design of the tracking method is as follow:
> > > 1. dax_assocaite_entry() associates the owner's info to this page
> > 
> > I think that's the first problem with this design.  dax_associate_entry is
> > a horrendous idea which needs to be ripped out, not made more important.
> > It's all part of the general problem of trying to do something on a
> > per-page basis instead of per-extent basis.
> > 
> 
> The memory-failure needs to track owners info from a dax page, so I should
> associate the owner with this page.  In this version, I associate the block
> device to the dax page, so that the memory-failure is able to iterate the
> owners by the query interface provided by filesystem.

No, it doesn't need to track owner info from a DAX page.  What it needs to
do is ask the filesystem.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink
  2020-08-10 11:16     ` Matthew Wilcox
@ 2020-08-10 21:10       ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2020-08-10 21:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ruan Shiyang, linux-kernel, linux-xfs, linux-nvdimm, linux-mm,
	linux-fsdevel, darrick.wong, dan.j.williams, hch, rgoldwyn,
	qi.fuli, y-goto

On Mon, Aug 10, 2020 at 12:16:57PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
> > 
> > 
> > On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Instead of per-page tracking method, this patchset introduces a query
> > > > interface: get_shared_files(), which is implemented by each FS, to
> > > > obtain the owners of a shared page.  It returns an owner list of this
> > > > shared page.  Then, the memory-failure() iterates the list to be able
> > > > to notify each process using files that sharing this page.
> > > > 
> > > > The design of the tracking method is as follow:
> > > > 1. dax_assocaite_entry() associates the owner's info to this page
> > > 
> > > I think that's the first problem with this design.  dax_associate_entry is
> > > a horrendous idea which needs to be ripped out, not made more important.
> > > It's all part of the general problem of trying to do something on a
> > > per-page basis instead of per-extent basis.
> > > 
> > 
> > The memory-failure needs to track owners info from a dax page, so I should
> > associate the owner with this page.  In this version, I associate the block
> > device to the dax page, so that the memory-failure is able to iterate the
> > owners by the query interface provided by filesystem.
> 
> No, it doesn't need to track owner info from a DAX page.  What it needs to
> do is ask the filesystem.

Just to add to this: the owner tracking that is current done deep
inside the DAX code needs to be moved out to the owner of the dax
device. That may be the dax device itself, or it may be a filesystem
like ext4 or XFS. Initially, nothing will be able to share pages and
page owner tracking should be done on the page itself as the DAX
code currently does.

Hence when a page error occurs, the device owner is called with the
page and error information, and the dax device or filesystem can
then look up the page owner (via mapping/index pointers) and run the
Die Userspace Die functions instead of running this all internally
in the DAX code.

Once the implementation is abstracted and controlled by the device
owner, then we can start working to change the XFS implementation to
support shared pages....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-08-10 21:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 13:13 [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink Shiyang Ruan
2020-08-07 16:15   ` Darrick J. Wong
2020-08-10  8:07     ` Ruan Shiyang
2020-08-07 13:13 ` [RFC PATCH 2/8] fsdax, mm: track files sharing dax page for memory-failure Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 3/8] fsdax: introduce dax_copy_edges() for COW Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 4/8] fsdax: copy data before write Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 5/8] fsdax: replace mmap entry in case of CoW Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 6/8] fsdax: dedup file range to use a compare function Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
2020-08-07 13:13 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
2020-08-07 13:38 ` [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink Matthew Wilcox
2020-08-10  8:15   ` Ruan Shiyang
2020-08-10 11:16     ` Matthew Wilcox
2020-08-10 21:10       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).