All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range
@ 2016-11-09 22:51 ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Hi all,

These patches wire up the existing ocfs2 reflinking capabilities to
the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
patches clean up some minor bugs that I found; the last kernel patch
contains the new code.

A few minor fixes to xfstests are needed to make more of the tests
run.  I'll tack that patch on the end.

--D

[1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range
@ 2016-11-09 22:51 ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Hi all,

These patches wire up the existing ocfs2 reflinking capabilities to
the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
patches clean up some minor bugs that I found; the last kernel patch
contains the new code.

A few minor fixes to xfstests are needed to make more of the tests
run.  I'll tack that patch on the end.

--D

[1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] ocfs2: convert inode refcount test to a helper
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
 fs/ocfs2/refcounttree.h |    2 ++
 2 files changed, 17 insertions(+), 13 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 1923851..59be8f4 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -48,6 +48,12 @@
 #include <linux/mount.h>
 #include <linux/posix_acl.h>
 
+/* Does this inode have the reflink flag set? */
+bool ocfs2_is_refcount_inode(struct inode *inode)
+{
+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+}
+
 struct ocfs2_cow_context {
 	struct inode *inode;
 	u32 cow_start;
@@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
 		goto out;
 	}
 
-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	di = (struct ocfs2_dinode *)di_bh->b_data;
 	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
@@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
 	u32 num_got;
 	u64 suballoc_loc, first_blkno;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	trace_ocfs2_create_refcount_tree(
 		(unsigned long long)OCFS2_I(inode)->ip_blkno);
@@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
 	struct ocfs2_refcount_block *rb;
 	struct ocfs2_refcount_tree *ref_tree;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
 				       &ref_tree, &ref_root_bh);
@@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
 	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
 	u16 bit = 0;
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
+	if (!ocfs2_is_refcount_inode(inode))
 		return 0;
 
 	BUG_ON(!ref_blkno);
@@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
 {
 	int ret;
 	u64 ref_blkno;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
 	if (ret) {
@@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 					  int *ref_blocks)
 {
 	int ret;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
@@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 		goto out;
 	}
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
 				      refcount_loc, &tree);
@@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
 {
 	int ret;
 	u32 cow_start = 0, cow_len = 0;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_cow_context *context = NULL;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
 					      cpos, write_len, max_cpos,
@@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
 {
 	int ret;
 	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_cow_context *context = NULL;
 	u32 cow_start, cow_len;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
 					      cpos, write_len, UINT_MAX,
@@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
 
 	ocfs2_init_dealloc_ctxt(&dealloc);
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
+	if (!ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_create_refcount_tree(inode, di_bh);
 		if (ret) {
 			mlog_errno(ret);
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 6422bbc..553edfb 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -17,6 +17,8 @@
 #ifndef OCFS2_REFCOUNTTREE_H
 #define OCFS2_REFCOUNTTREE_H
 
+bool ocfs2_is_refcount_inode(struct inode *inode);
+
 struct ocfs2_refcount_tree {
 	struct rb_node rf_node;
 	u64 rf_blkno;


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 1/6] ocfs2: convert inode refcount test to a helper
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
 fs/ocfs2/refcounttree.h |    2 ++
 2 files changed, 17 insertions(+), 13 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 1923851..59be8f4 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -48,6 +48,12 @@
 #include <linux/mount.h>
 #include <linux/posix_acl.h>
 
+/* Does this inode have the reflink flag set? */
+bool ocfs2_is_refcount_inode(struct inode *inode)
+{
+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+}
+
 struct ocfs2_cow_context {
 	struct inode *inode;
 	u32 cow_start;
@@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
 		goto out;
 	}
 
-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	di = (struct ocfs2_dinode *)di_bh->b_data;
 	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
@@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
 	u32 num_got;
 	u64 suballoc_loc, first_blkno;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	trace_ocfs2_create_refcount_tree(
 		(unsigned long long)OCFS2_I(inode)->ip_blkno);
@@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
 	struct ocfs2_refcount_block *rb;
 	struct ocfs2_refcount_tree *ref_tree;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
 				       &ref_tree, &ref_root_bh);
@@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
 	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
 	u16 bit = 0;
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
+	if (!ocfs2_is_refcount_inode(inode))
 		return 0;
 
 	BUG_ON(!ref_blkno);
@@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
 {
 	int ret;
 	u64 ref_blkno;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
 	if (ret) {
@@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 					  int *ref_blocks)
 {
 	int ret;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
@@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 		goto out;
 	}
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
 				      refcount_loc, &tree);
@@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
 {
 	int ret;
 	u32 cow_start = 0, cow_len = 0;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_cow_context *context = NULL;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
 					      cpos, write_len, max_cpos,
@@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
 {
 	int ret;
 	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_cow_context *context = NULL;
 	u32 cow_start, cow_len;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
 					      cpos, write_len, UINT_MAX,
@@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
 
 	ocfs2_init_dealloc_ctxt(&dealloc);
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
+	if (!ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_create_refcount_tree(inode, di_bh);
 		if (ret) {
 			mlog_errno(ret);
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 6422bbc..553edfb 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -17,6 +17,8 @@
 #ifndef OCFS2_REFCOUNTTREE_H
 #define OCFS2_REFCOUNTTREE_H
 
+bool ocfs2_is_refcount_inode(struct inode *inode);
+
 struct ocfs2_refcount_tree {
 	struct rb_node rf_node;
 	u64 rf_blkno;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] ocfs2: add newlines to some error messages
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

These two error messages are missing the trailing newline.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/alloc.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f72712f..bb2d207 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5194,7 +5194,7 @@ int ocfs2_change_extent_flag(handle_t *handle,
 	rec = &el->l_recs[index];
 	if (new_flags && (rec->e_flags & new_flags)) {
 		mlog(ML_ERROR, "Owner %llu tried to set %d flags on an "
-		     "extent that already had them",
+		     "extent that already had them\n",
 		     (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci),
 		     new_flags);
 		goto out;
@@ -5202,7 +5202,7 @@ int ocfs2_change_extent_flag(handle_t *handle,
 
 	if (clear_flags && !(rec->e_flags & clear_flags)) {
 		mlog(ML_ERROR, "Owner %llu tried to clear %d flags on an "
-		     "extent that didn't have them",
+		     "extent that didn't have them\n",
 		     (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci),
 		     clear_flags);
 		goto out;


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 2/6] ocfs2: add newlines to some error messages
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

These two error messages are missing the trailing newline.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/alloc.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f72712f..bb2d207 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5194,7 +5194,7 @@ int ocfs2_change_extent_flag(handle_t *handle,
 	rec = &el->l_recs[index];
 	if (new_flags && (rec->e_flags & new_flags)) {
 		mlog(ML_ERROR, "Owner %llu tried to set %d flags on an "
-		     "extent that already had them",
+		     "extent that already had them\n",
 		     (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci),
 		     new_flags);
 		goto out;
@@ -5202,7 +5202,7 @@ int ocfs2_change_extent_flag(handle_t *handle,
 
 	if (clear_flags && !(rec->e_flags & clear_flags)) {
 		mlog(ML_ERROR, "Owner %llu tried to clear %d flags on an "
-		     "extent that didn't have them",
+		     "extent that didn't have them\n",
 		     (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci),
 		     clear_flags);
 		goto out;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] ocfs2: prohibit refcounted swapfiles
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

The swapfile mechanism calls bmap once to find all the swap file
mappings, which means that we cannot properly support CoW remapping.
Therefore, error out if the swap code tries to call bmap on a
refcounted file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/aops.c |    9 +++++++++
 1 file changed, 9 insertions(+)


diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index c5c5b97..4d037db 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -464,6 +464,15 @@ static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
 	trace_ocfs2_bmap((unsigned long long)OCFS2_I(inode)->ip_blkno,
 			 (unsigned long long)block);
 
+	/*
+	 * The swap code (ab-)uses ->bmap to get a block mapping and then
+	 * bypasseѕ the file system for actual I/O.  We really can't allow
+	 * that on refcounted inodes, so we have to skip out here.  And yes,
+	 * 0 is the magic code for a bmap error..
+	 */
+	if (ocfs2_is_refcount_inode(inode))
+		return 0;
+
 	/* We don't need to lock journal system files, since they aren't
 	 * accessed concurrently from multiple nodes.
 	 */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 3/6] ocfs2: prohibit refcounted swapfiles
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

The swapfile mechanism calls bmap once to find all the swap file
mappings, which means that we cannot properly support CoW remapping.
Therefore, error out if the swap code tries to call bmap on a
refcounted file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/aops.c |    9 +++++++++
 1 file changed, 9 insertions(+)


diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index c5c5b97..4d037db 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -464,6 +464,15 @@ static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
 	trace_ocfs2_bmap((unsigned long long)OCFS2_I(inode)->ip_blkno,
 			 (unsigned long long)block);
 
+	/*
+	 * The swap code (ab-)uses ->bmap to get a block mapping and then
+	 * bypasse? the file system for actual I/O.  We really can't allow
+	 * that on refcounted inodes, so we have to skip out here.  And yes,
+	 * 0 is the magic code for a bmap error..
+	 */
+	if (ocfs2_is_refcount_inode(inode))
+		return 0;
+
 	/* We don't need to lock journal system files, since they aren't
 	 * accessed concurrently from multiple nodes.
 	 */

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

When we're adding the refcount flag to an extent, we have to budget
enough space to handle a full extent btree split in addition to
whatever modifications have to be made to the refcount btree.  We
don't currently do this, with the result that generic/186 crashes
when we need an extent split but not a refcount split because meta_ac
never gets allocated.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 59be8f4..d92b6c6 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_alloc_context *meta_ac = NULL;
 
+	/* We need to be able to handle at least an extent tree split. */
+	ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
+
 	ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
 					       ref_ci, ref_root_bh,
 					       p_cluster, num_clusters,


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

When we're adding the refcount flag to an extent, we have to budget
enough space to handle a full extent btree split in addition to
whatever modifications have to be made to the refcount btree.  We
don't currently do this, with the result that generic/186 crashes
when we need an extent split but not a refcount split because meta_ac
never gets allocated.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 59be8f4..d92b6c6 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_alloc_context *meta_ac = NULL;
 
+	/* We need to be able to handle at least an extent tree split. */
+	ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
+
 	ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
 					       ref_ci, ref_root_bh,
 					       p_cluster, num_clusters,

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] ocfs2: don't eat io errors during _dio_end_io_write
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

ocfs2_dio_end_io_write eats whatever errors may happen,
which means that write errors do not propagate to userspace.
Fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/aops.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 4d037db..136a49c 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2263,10 +2263,10 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
 	return ret;
 }
 
-static void ocfs2_dio_end_io_write(struct inode *inode,
-				   struct ocfs2_dio_write_ctxt *dwc,
-				   loff_t offset,
-				   ssize_t bytes)
+static int ocfs2_dio_end_io_write(struct inode *inode,
+				  struct ocfs2_dio_write_ctxt *dwc,
+				  loff_t offset,
+				  ssize_t bytes)
 {
 	struct ocfs2_cached_dealloc_ctxt dealloc;
 	struct ocfs2_extent_tree et;
@@ -2374,6 +2374,8 @@ static void ocfs2_dio_end_io_write(struct inode *inode,
 	if (locked)
 		inode_unlock(inode);
 	ocfs2_dio_free_write_ctx(inode, dwc);
+
+	return ret;
 }
 
 /*
@@ -2388,6 +2390,7 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	int level;
+	int ret = 0;
 
 	if (bytes <= 0)
 		return 0;
@@ -2396,13 +2399,13 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (private)
-		ocfs2_dio_end_io_write(inode, private, offset, bytes);
+		ret = ocfs2_dio_end_io_write(inode, private, offset, bytes);
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
 	level = ocfs2_iocb_rw_locked_level(iocb);
 	ocfs2_rw_unlock(inode, level);
-	return 0;
+	return ret;
 }
 
 static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 5/6] ocfs2: don't eat io errors during _dio_end_io_write
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

ocfs2_dio_end_io_write eats whatever errors may happen,
which means that write errors do not propagate to userspace.
Fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/aops.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 4d037db..136a49c 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2263,10 +2263,10 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
 	return ret;
 }
 
-static void ocfs2_dio_end_io_write(struct inode *inode,
-				   struct ocfs2_dio_write_ctxt *dwc,
-				   loff_t offset,
-				   ssize_t bytes)
+static int ocfs2_dio_end_io_write(struct inode *inode,
+				  struct ocfs2_dio_write_ctxt *dwc,
+				  loff_t offset,
+				  ssize_t bytes)
 {
 	struct ocfs2_cached_dealloc_ctxt dealloc;
 	struct ocfs2_extent_tree et;
@@ -2374,6 +2374,8 @@ static void ocfs2_dio_end_io_write(struct inode *inode,
 	if (locked)
 		inode_unlock(inode);
 	ocfs2_dio_free_write_ctx(inode, dwc);
+
+	return ret;
 }
 
 /*
@@ -2388,6 +2390,7 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	int level;
+	int ret = 0;
 
 	if (bytes <= 0)
 		return 0;
@@ -2396,13 +2399,13 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (private)
-		ocfs2_dio_end_io_write(inode, private, offset, bytes);
+		ret = ocfs2_dio_end_io_write(inode, private, offset, bytes);
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
 	level = ocfs2_iocb_rw_locked_level(iocb);
 	ocfs2_rw_unlock(inode, level);
-	return 0;
+	return ret;
 }
 
 static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 22:51   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/file.c         |   62 ++++-
 fs/ocfs2/file.h         |    3 
 fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h |    7 +
 4 files changed, 688 insertions(+), 3 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 000c234..d5a022d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
 	*done = ret;
 }
 
-static int ocfs2_remove_inode_range(struct inode *inode,
-				    struct buffer_head *di_bh, u64 byte_start,
-				    u64 byte_len)
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len)
 {
 	int ret = 0, flags = 0, done = 0, i;
 	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
@@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static ssize_t ocfs2_file_copy_range(struct file *file_in,
+				     loff_t pos_in,
+				     struct file *file_out,
+				     loff_t pos_out,
+				     size_t len,
+				     unsigned int flags)
+{
+	int error;
+
+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					  len, false);
+	if (error)
+		return error;
+	return len;
+}
+
+static int ocfs2_file_clone_range(struct file *file_in,
+				  loff_t pos_in,
+				  struct file *file_out,
+				  loff_t pos_out,
+				  u64 len)
+{
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					 len, false);
+}
+
+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
+				       u64 loff,
+				       u64 len,
+				       struct file *dst_file,
+				       u64 dst_loff)
+{
+	int error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > OCFS2_MAX_DEDUPE_LEN)
+		len = OCFS2_MAX_DEDUPE_LEN;
+
+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
+					  len, true);
+	if (error)
+		return error;
+	return len;
+}
+
 const struct inode_operations ocfs2_file_iops = {
 	.setattr	= ocfs2_setattr,
 	.getattr	= ocfs2_getattr,
@@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops = {
@@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops_no_plocks = {
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index e8c62f2..897fd9a 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 
 int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 				   size_t count);
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len);
 #endif /* OCFS2_FILE_H */
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index d92b6c6..3e2198c 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -34,6 +34,7 @@
 #include "xattr.h"
 #include "namei.h"
 #include "ocfs2_trace.h"
+#include "file.h"
 
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 
 	return error;
 }
+
+/* Update destination inode size, if necessary. */
+static int ocfs2_reflink_update_dest(struct inode *dest,
+				     struct buffer_head *d_bh,
+				     loff_t newlen)
+{
+	handle_t *handle;
+	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
+	int ret;
+
+	if (newlen <= i_size_read(dest))
+		return 0;
+
+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
+				   OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		return ret;
+	}
+
+	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
+				      OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_commit;
+	}
+
+	spin_lock(&OCFS2_I(dest)->ip_lock);
+	if (newlen > i_size_read(dest)) {
+		i_size_write(dest, newlen);
+		di->i_size = newlen;
+	}
+	spin_unlock(&OCFS2_I(dest)->ip_lock);
+
+	ocfs2_journal_dirty(handle, d_bh);
+
+out_commit:
+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
+	return ret;
+}
+
+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len,
+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
+{
+	struct ocfs2_extent_tree s_et;
+	struct ocfs2_extent_tree t_et;
+	struct ocfs2_dinode *dis;
+	struct buffer_head *ref_root_bh = NULL;
+	struct ocfs2_refcount_tree *ref_tree;
+	struct ocfs2_super *osb;
+	loff_t pstart, plen;
+	u32 p_cluster, num_clusters, slast, spos, tpos;
+	unsigned int ext_flags;
+	int ret = 0;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
+
+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
+
+	while (spos < slast) {
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+
+		/* Look up the extent. */
+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
+					 &num_clusters, &ext_flags);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		num_clusters = min_t(u32, num_clusters, slast - spos);
+
+		/* Punch out the dest range. */
+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		if (p_cluster == 0)
+			goto next_loop;
+
+		/* Lock the refcount btree... */
+		ret = ocfs2_lock_refcount_tree(osb,
+					       le64_to_cpu(dis->i_refcount_loc),
+					       1, &ref_tree, &ref_root_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		/* Mark s_inode's extent as refcounted. */
+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
+						      &ref_tree->rf_ci,
+						      ref_root_bh, spos,
+						      p_cluster, num_clusters,
+						      dealloc, NULL);
+			if (ret) {
+				mlog_errno(ret);
+				goto out_unlock_refcount;
+			}
+		}
+
+		/* Map in the new extent. */
+		ext_flags |= OCFS2_EXT_REFCOUNTED;
+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
+						  &ref_tree->rf_ci,
+						  ref_root_bh,
+						  tpos, p_cluster,
+						  num_clusters,
+						  ext_flags,
+						  dealloc);
+		if (ret) {
+			mlog_errno(ret);
+			goto out_unlock_refcount;
+		}
+
+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+		brelse(ref_root_bh);
+next_loop:
+		spos += num_clusters;
+		tpos += num_clusters;
+	}
+
+out:
+	return ret;
+out_unlock_refcount:
+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+	brelse(ref_root_bh);
+	return ret;
+}
+
+/* Set up refcount tree and remap s_inode to t_inode. */
+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len)
+{
+	struct ocfs2_cached_dealloc_ctxt dealloc;
+	struct ocfs2_super *osb;
+	struct ocfs2_dinode *dis;
+	struct ocfs2_dinode *dit;
+	int ret;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	dit = (struct ocfs2_dinode *)t_bh->b_data;
+	ocfs2_init_dealloc_ctxt(&dealloc);
+
+	/*
+	 * If both inodes belong to two different refcount groups then
+	 * forget it because we don't know how (or want) to go merging
+	 * refcount trees.
+	 */
+	ret = -EOPNOTSUPP;
+	if (ocfs2_is_refcount_inode(s_inode) &&
+	    ocfs2_is_refcount_inode(t_inode) &&
+	    le64_to_cpu(dis->i_refcount_loc) !=
+	    le64_to_cpu(dit->i_refcount_loc))
+		goto out;
+
+	/* Neither inode has a refcount tree.  Add one to s_inode. */
+	if (!ocfs2_is_refcount_inode(s_inode) &&
+	    !ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Ensure that both inodes end up with the same refcount tree. */
+	if (!ocfs2_is_refcount_inode(s_inode)) {
+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
+					      le64_to_cpu(dit->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+	if (!ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
+					      le64_to_cpu(dis->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/*
+	 * If we're reflinking the entire file and the source is inline
+	 * data, just copy the contents.
+	 */
+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
+	    i_size_read(t_inode) <= len &&
+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
+		if (ret)
+			mlog_errno(ret);
+		goto out;
+	}
+
+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
+					 pos_out, len, &dealloc);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+out:
+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
+		ocfs2_schedule_truncate_log_flush(osb, 1);
+		ocfs2_run_deallocs(osb, &dealloc);
+	}
+
+	return ret;
+}
+
+/* Lock an inode and grab a bh pointing to the inode. */
+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
+				     struct buffer_head **bh1,
+				     struct inode *t_inode,
+				     struct buffer_head **bh2)
+{
+	struct inode *inode1;
+	struct inode *inode2;
+	struct ocfs2_inode_info *oi1;
+	struct ocfs2_inode_info *oi2;
+	bool same_inode = (s_inode == t_inode);
+	int status;
+
+	/* First grab the VFS and rw locks. */
+	inode1 = s_inode;
+	inode2 = t_inode;
+	if (inode1->i_ino > inode2->i_ino)
+		swap(inode1, inode2);
+
+	inode_lock(inode1);
+	status = ocfs2_rw_lock(inode1, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out_i1;
+	}
+	if (!same_inode) {
+		inode_lock_nested(inode2, I_MUTEX_CHILD);
+		status = ocfs2_rw_lock(inode2, 1);
+		if (status) {
+			mlog_errno(status);
+			goto out_i2;
+		}
+	}
+
+	/* Now go for the cluster locks */
+	oi1 = OCFS2_I(inode1);
+	oi2 = OCFS2_I(inode2);
+
+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
+				(unsigned long long)oi2->ip_blkno);
+
+	if (*bh1)
+		*bh1 = NULL;
+	if (*bh2)
+		*bh2 = NULL;
+
+	/* We always want to lock the one with the lower lockid first. */
+	if (oi1->ip_blkno > oi2->ip_blkno)
+		mlog_errno(-ENOLCK);
+
+	/* lock id1 */
+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
+	if (status < 0) {
+		if (status != -ENOENT)
+			mlog_errno(status);
+		goto out_rw2;
+	}
+
+	/* lock id2 */
+	if (!same_inode) {
+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
+						 OI_LS_REFLINK_TARGET);
+		if (status < 0) {
+			if (status != -ENOENT)
+				mlog_errno(status);
+			goto out_cl1;
+		}
+	} else
+		*bh2 = *bh1;
+
+	trace_ocfs2_double_lock_end(
+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
+
+	return 0;
+
+out_cl1:
+	ocfs2_inode_unlock(inode1, 1);
+	brelse(*bh1);
+	*bh1 = NULL;
+out_rw2:
+	ocfs2_rw_unlock(inode2, 1);
+out_i2:
+	inode_unlock(inode2);
+	ocfs2_rw_unlock(inode1, 1);
+out_i1:
+	inode_unlock(inode1);
+	return status;
+}
+
+/* Unlock both inodes and release buffers. */
+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
+					struct buffer_head *s_bh,
+					struct inode *t_inode,
+					struct buffer_head *t_bh)
+{
+	ocfs2_inode_unlock(s_inode, 1);
+	ocfs2_rw_unlock(s_inode, 1);
+	inode_unlock(s_inode);
+	brelse(s_bh);
+
+	if (s_inode == t_inode)
+		return;
+
+	ocfs2_inode_unlock(t_inode, 1);
+	ocfs2_rw_unlock(t_inode, 1);
+	inode_unlock(t_inode);
+	brelse(t_bh);
+}
+
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *ocfs2_reflink_get_page(struct inode *inode,
+					   loff_t offset)
+{
+	struct address_space *mapping;
+	struct page *page;
+	pgoff_t n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int ocfs2_reflink_compare_extents(struct inode *src,
+					 loff_t srcoff,
+					 struct inode *dest,
+					 loff_t destoff,
+					 loff_t len,
+					 bool *is_same)
+{
+	loff_t src_poff;
+	loff_t dest_poff;
+	void *src_addr;
+	void *dest_addr;
+	struct page *src_page;
+	struct page *dest_page;
+	loff_t cmp_len;
+	bool same;
+	int error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		if (cmp_len <= 0) {
+			mlog_errno(-EUCLEAN);
+			goto out_error;
+		}
+
+		src_page = ocfs2_reflink_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = ocfs2_reflink_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Link a range of blocks from one file to another. */
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
+	bool same_inode = (inode_in == inode_out);
+	bool is_same = false;
+	loff_t isize;
+	ssize_t ret;
+	loff_t blen;
+
+	if (!ocfs2_refcount_tree(osb))
+		return -EOPNOTSUPP;
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	/* Lock both files against IO */
+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
+		goto out_unlock;
+
+	/* Don't touch certain kinds of inodes */
+	ret = -EPERM;
+	if (IS_IMMUTABLE(inode_out))
+		goto out_unlock;
+
+	ret = -ETXTBSY;
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		goto out_unlock;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	ret = -EISDIR;
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		goto out_unlock;
+	ret = -EINVAL;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		goto out_unlock;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		goto out_unlock;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		goto out_unlock;
+
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			goto out_unlock;
+	}
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		goto out_unlock;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode) {
+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
+			goto out_unlock;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode_in);
+	if (!same_inode)
+		inode_dio_wait(inode_out);
+
+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
+			pos_in, pos_in + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
+			pos_out, pos_out + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (is_dedupe) {
+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
+						    inode_out, pos_out,
+						    len, &is_same);
+		if (ret)
+			goto out_unlock;
+		if (!is_same) {
+			ret = -EBADE;
+			goto out_unlock;
+		}
+	}
+
+	/* Lock out changes to the allocation maps */
+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
+				  SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
+
+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	/*
+	 * Empty the extent map so that we may get the right extent
+	 * record from the disk.
+	 */
+	ocfs2_extent_map_trunc(inode_in, 0);
+	ocfs2_extent_map_trunc(inode_out, 0);
+
+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return 0;
+
+out_unlock:
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return ret;
+}
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 553edfb..c023e88 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe);
+
 #endif /* OCFS2_REFCOUNTTREE_H */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-09 22:51   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 22:51 UTC (permalink / raw)
  To: mfasheh, jlbec, darrick.wong; +Cc: linux-fsdevel, ocfs2-devel

Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/file.c         |   62 ++++-
 fs/ocfs2/file.h         |    3 
 fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h |    7 +
 4 files changed, 688 insertions(+), 3 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 000c234..d5a022d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
 	*done = ret;
 }
 
-static int ocfs2_remove_inode_range(struct inode *inode,
-				    struct buffer_head *di_bh, u64 byte_start,
-				    u64 byte_len)
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len)
 {
 	int ret = 0, flags = 0, done = 0, i;
 	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
@@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static ssize_t ocfs2_file_copy_range(struct file *file_in,
+				     loff_t pos_in,
+				     struct file *file_out,
+				     loff_t pos_out,
+				     size_t len,
+				     unsigned int flags)
+{
+	int error;
+
+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					  len, false);
+	if (error)
+		return error;
+	return len;
+}
+
+static int ocfs2_file_clone_range(struct file *file_in,
+				  loff_t pos_in,
+				  struct file *file_out,
+				  loff_t pos_out,
+				  u64 len)
+{
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					 len, false);
+}
+
+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
+				       u64 loff,
+				       u64 len,
+				       struct file *dst_file,
+				       u64 dst_loff)
+{
+	int error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > OCFS2_MAX_DEDUPE_LEN)
+		len = OCFS2_MAX_DEDUPE_LEN;
+
+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
+					  len, true);
+	if (error)
+		return error;
+	return len;
+}
+
 const struct inode_operations ocfs2_file_iops = {
 	.setattr	= ocfs2_setattr,
 	.getattr	= ocfs2_getattr,
@@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops = {
@@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops_no_plocks = {
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index e8c62f2..897fd9a 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 
 int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 				   size_t count);
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len);
 #endif /* OCFS2_FILE_H */
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index d92b6c6..3e2198c 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -34,6 +34,7 @@
 #include "xattr.h"
 #include "namei.h"
 #include "ocfs2_trace.h"
+#include "file.h"
 
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 
 	return error;
 }
+
+/* Update destination inode size, if necessary. */
+static int ocfs2_reflink_update_dest(struct inode *dest,
+				     struct buffer_head *d_bh,
+				     loff_t newlen)
+{
+	handle_t *handle;
+	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
+	int ret;
+
+	if (newlen <= i_size_read(dest))
+		return 0;
+
+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
+				   OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		return ret;
+	}
+
+	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
+				      OCFS2_JOURNAL_ACCESS_WRITE);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_commit;
+	}
+
+	spin_lock(&OCFS2_I(dest)->ip_lock);
+	if (newlen > i_size_read(dest)) {
+		i_size_write(dest, newlen);
+		di->i_size = newlen;
+	}
+	spin_unlock(&OCFS2_I(dest)->ip_lock);
+
+	ocfs2_journal_dirty(handle, d_bh);
+
+out_commit:
+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
+	return ret;
+}
+
+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len,
+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
+{
+	struct ocfs2_extent_tree s_et;
+	struct ocfs2_extent_tree t_et;
+	struct ocfs2_dinode *dis;
+	struct buffer_head *ref_root_bh = NULL;
+	struct ocfs2_refcount_tree *ref_tree;
+	struct ocfs2_super *osb;
+	loff_t pstart, plen;
+	u32 p_cluster, num_clusters, slast, spos, tpos;
+	unsigned int ext_flags;
+	int ret = 0;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
+
+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
+
+	while (spos < slast) {
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+
+		/* Look up the extent. */
+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
+					 &num_clusters, &ext_flags);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		num_clusters = min_t(u32, num_clusters, slast - spos);
+
+		/* Punch out the dest range. */
+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		if (p_cluster == 0)
+			goto next_loop;
+
+		/* Lock the refcount btree... */
+		ret = ocfs2_lock_refcount_tree(osb,
+					       le64_to_cpu(dis->i_refcount_loc),
+					       1, &ref_tree, &ref_root_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		/* Mark s_inode's extent as refcounted. */
+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
+						      &ref_tree->rf_ci,
+						      ref_root_bh, spos,
+						      p_cluster, num_clusters,
+						      dealloc, NULL);
+			if (ret) {
+				mlog_errno(ret);
+				goto out_unlock_refcount;
+			}
+		}
+
+		/* Map in the new extent. */
+		ext_flags |= OCFS2_EXT_REFCOUNTED;
+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
+						  &ref_tree->rf_ci,
+						  ref_root_bh,
+						  tpos, p_cluster,
+						  num_clusters,
+						  ext_flags,
+						  dealloc);
+		if (ret) {
+			mlog_errno(ret);
+			goto out_unlock_refcount;
+		}
+
+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+		brelse(ref_root_bh);
+next_loop:
+		spos += num_clusters;
+		tpos += num_clusters;
+	}
+
+out:
+	return ret;
+out_unlock_refcount:
+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+	brelse(ref_root_bh);
+	return ret;
+}
+
+/* Set up refcount tree and remap s_inode to t_inode. */
+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len)
+{
+	struct ocfs2_cached_dealloc_ctxt dealloc;
+	struct ocfs2_super *osb;
+	struct ocfs2_dinode *dis;
+	struct ocfs2_dinode *dit;
+	int ret;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	dit = (struct ocfs2_dinode *)t_bh->b_data;
+	ocfs2_init_dealloc_ctxt(&dealloc);
+
+	/*
+	 * If both inodes belong to two different refcount groups then
+	 * forget it because we don't know how (or want) to go merging
+	 * refcount trees.
+	 */
+	ret = -EOPNOTSUPP;
+	if (ocfs2_is_refcount_inode(s_inode) &&
+	    ocfs2_is_refcount_inode(t_inode) &&
+	    le64_to_cpu(dis->i_refcount_loc) !=
+	    le64_to_cpu(dit->i_refcount_loc))
+		goto out;
+
+	/* Neither inode has a refcount tree.  Add one to s_inode. */
+	if (!ocfs2_is_refcount_inode(s_inode) &&
+	    !ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Ensure that both inodes end up with the same refcount tree. */
+	if (!ocfs2_is_refcount_inode(s_inode)) {
+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
+					      le64_to_cpu(dit->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+	if (!ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
+					      le64_to_cpu(dis->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/*
+	 * If we're reflinking the entire file and the source is inline
+	 * data, just copy the contents.
+	 */
+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
+	    i_size_read(t_inode) <= len &&
+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
+		if (ret)
+			mlog_errno(ret);
+		goto out;
+	}
+
+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
+					 pos_out, len, &dealloc);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+out:
+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
+		ocfs2_schedule_truncate_log_flush(osb, 1);
+		ocfs2_run_deallocs(osb, &dealloc);
+	}
+
+	return ret;
+}
+
+/* Lock an inode and grab a bh pointing to the inode. */
+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
+				     struct buffer_head **bh1,
+				     struct inode *t_inode,
+				     struct buffer_head **bh2)
+{
+	struct inode *inode1;
+	struct inode *inode2;
+	struct ocfs2_inode_info *oi1;
+	struct ocfs2_inode_info *oi2;
+	bool same_inode = (s_inode == t_inode);
+	int status;
+
+	/* First grab the VFS and rw locks. */
+	inode1 = s_inode;
+	inode2 = t_inode;
+	if (inode1->i_ino > inode2->i_ino)
+		swap(inode1, inode2);
+
+	inode_lock(inode1);
+	status = ocfs2_rw_lock(inode1, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out_i1;
+	}
+	if (!same_inode) {
+		inode_lock_nested(inode2, I_MUTEX_CHILD);
+		status = ocfs2_rw_lock(inode2, 1);
+		if (status) {
+			mlog_errno(status);
+			goto out_i2;
+		}
+	}
+
+	/* Now go for the cluster locks */
+	oi1 = OCFS2_I(inode1);
+	oi2 = OCFS2_I(inode2);
+
+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
+				(unsigned long long)oi2->ip_blkno);
+
+	if (*bh1)
+		*bh1 = NULL;
+	if (*bh2)
+		*bh2 = NULL;
+
+	/* We always want to lock the one with the lower lockid first. */
+	if (oi1->ip_blkno > oi2->ip_blkno)
+		mlog_errno(-ENOLCK);
+
+	/* lock id1 */
+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
+	if (status < 0) {
+		if (status != -ENOENT)
+			mlog_errno(status);
+		goto out_rw2;
+	}
+
+	/* lock id2 */
+	if (!same_inode) {
+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
+						 OI_LS_REFLINK_TARGET);
+		if (status < 0) {
+			if (status != -ENOENT)
+				mlog_errno(status);
+			goto out_cl1;
+		}
+	} else
+		*bh2 = *bh1;
+
+	trace_ocfs2_double_lock_end(
+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
+
+	return 0;
+
+out_cl1:
+	ocfs2_inode_unlock(inode1, 1);
+	brelse(*bh1);
+	*bh1 = NULL;
+out_rw2:
+	ocfs2_rw_unlock(inode2, 1);
+out_i2:
+	inode_unlock(inode2);
+	ocfs2_rw_unlock(inode1, 1);
+out_i1:
+	inode_unlock(inode1);
+	return status;
+}
+
+/* Unlock both inodes and release buffers. */
+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
+					struct buffer_head *s_bh,
+					struct inode *t_inode,
+					struct buffer_head *t_bh)
+{
+	ocfs2_inode_unlock(s_inode, 1);
+	ocfs2_rw_unlock(s_inode, 1);
+	inode_unlock(s_inode);
+	brelse(s_bh);
+
+	if (s_inode == t_inode)
+		return;
+
+	ocfs2_inode_unlock(t_inode, 1);
+	ocfs2_rw_unlock(t_inode, 1);
+	inode_unlock(t_inode);
+	brelse(t_bh);
+}
+
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *ocfs2_reflink_get_page(struct inode *inode,
+					   loff_t offset)
+{
+	struct address_space *mapping;
+	struct page *page;
+	pgoff_t n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int ocfs2_reflink_compare_extents(struct inode *src,
+					 loff_t srcoff,
+					 struct inode *dest,
+					 loff_t destoff,
+					 loff_t len,
+					 bool *is_same)
+{
+	loff_t src_poff;
+	loff_t dest_poff;
+	void *src_addr;
+	void *dest_addr;
+	struct page *src_page;
+	struct page *dest_page;
+	loff_t cmp_len;
+	bool same;
+	int error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		if (cmp_len <= 0) {
+			mlog_errno(-EUCLEAN);
+			goto out_error;
+		}
+
+		src_page = ocfs2_reflink_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = ocfs2_reflink_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Link a range of blocks from one file to another. */
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
+	bool same_inode = (inode_in == inode_out);
+	bool is_same = false;
+	loff_t isize;
+	ssize_t ret;
+	loff_t blen;
+
+	if (!ocfs2_refcount_tree(osb))
+		return -EOPNOTSUPP;
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	/* Lock both files against IO */
+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
+		goto out_unlock;
+
+	/* Don't touch certain kinds of inodes */
+	ret = -EPERM;
+	if (IS_IMMUTABLE(inode_out))
+		goto out_unlock;
+
+	ret = -ETXTBSY;
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		goto out_unlock;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	ret = -EISDIR;
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		goto out_unlock;
+	ret = -EINVAL;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		goto out_unlock;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		goto out_unlock;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		goto out_unlock;
+
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			goto out_unlock;
+	}
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		goto out_unlock;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode) {
+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
+			goto out_unlock;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode_in);
+	if (!same_inode)
+		inode_dio_wait(inode_out);
+
+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
+			pos_in, pos_in + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
+			pos_out, pos_out + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (is_dedupe) {
+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
+						    inode_out, pos_out,
+						    len, &is_same);
+		if (ret)
+			goto out_unlock;
+		if (!is_same) {
+			ret = -EBADE;
+			goto out_unlock;
+		}
+	}
+
+	/* Lock out changes to the allocation maps */
+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
+				  SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
+
+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	/*
+	 * Empty the extent map so that we may get the right extent
+	 * record from the disk.
+	 */
+	ocfs2_extent_map_trunc(inode_in, 0);
+	ocfs2_extent_map_trunc(inode_out, 0);
+
+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return 0;
+
+out_unlock:
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return ret;
+}
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 553edfb..c023e88 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe);
+
 #endif /* OCFS2_REFCOUNTTREE_H */

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/6] xfstests: fix some minor problems testing ocfs2
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-09 23:00   ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 23:00 UTC (permalink / raw)
  To: mfasheh, jlbec, eguan; +Cc: linux-fsdevel, ocfs2-devel, fstests

There are a a few things about ocfs2 tools that need special-casing in
xfstests, so fix them so that we can start testing ocfs2.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 common/quota |    2 +-
 common/rc    |   10 ++++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/common/quota b/common/quota
index 678bc43..d9bb8d9 100644
--- a/common/quota
+++ b/common/quota
@@ -34,7 +34,7 @@ _require_quota()
 	    _notrun "Installed kernel does not support quotas"
 	fi
 	;;
-    gfs2)
+    gfs2|ocfs2)
 	;;
     xfs)
 	if [ ! -f /proc/fs/xfs/xqmstat ]; then
diff --git a/common/rc b/common/rc
index 8e078da..c75b614 100644
--- a/common/rc
+++ b/common/rc
@@ -978,7 +978,7 @@ _scratch_mkfs_sized()
     xfs)
 	def_blksz=`echo $MKFS_OPTIONS|sed -rn 's/.*-b ?size= ?+([0-9]+).*/\1/p'`
 	;;
-    ext2|ext3|ext4|ext4dev|udf|btrfs|reiser4)
+    ext2|ext3|ext4|ext4dev|udf|btrfs|reiser4|ocfs2)
 	def_blksz=`echo $MKFS_OPTIONS| sed -rn 's/.*-b ?+([0-9]+).*/\1/p'`
 	;;
     esac
@@ -1015,6 +1015,9 @@ _scratch_mkfs_sized()
     ext2|ext3|ext4|ext4dev)
 	${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
 	;;
+    ocfs2)
+	yes | ${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
+	;;
     udf)
 	$MKFS_UDF_PROG $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
 	;;
@@ -1087,9 +1090,12 @@ _scratch_mkfs_blocksized()
     xfs)
 	_scratch_mkfs_xfs $MKFS_OPTIONS -b size=$blocksize
 	;;
-    ext2|ext3|ext4|ocfs2)
+    ext2|ext3|ext4)
 	${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV
 	;;
+    ocfs2)
+	yes | ${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV
+	;;
     *)
 	_notrun "Filesystem $FSTYP not supported in _scratch_mkfs_blocksized"
 	;;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 7/6] xfstests: fix some minor problems testing ocfs2
@ 2016-11-09 23:00   ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-09 23:00 UTC (permalink / raw)
  To: mfasheh, jlbec, eguan; +Cc: linux-fsdevel, ocfs2-devel, fstests

There are a a few things about ocfs2 tools that need special-casing in
xfstests, so fix them so that we can start testing ocfs2.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 common/quota |    2 +-
 common/rc    |   10 ++++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/common/quota b/common/quota
index 678bc43..d9bb8d9 100644
--- a/common/quota
+++ b/common/quota
@@ -34,7 +34,7 @@ _require_quota()
 	    _notrun "Installed kernel does not support quotas"
 	fi
 	;;
-    gfs2)
+    gfs2|ocfs2)
 	;;
     xfs)
 	if [ ! -f /proc/fs/xfs/xqmstat ]; then
diff --git a/common/rc b/common/rc
index 8e078da..c75b614 100644
--- a/common/rc
+++ b/common/rc
@@ -978,7 +978,7 @@ _scratch_mkfs_sized()
     xfs)
 	def_blksz=`echo $MKFS_OPTIONS|sed -rn 's/.*-b ?size= ?+([0-9]+).*/\1/p'`
 	;;
-    ext2|ext3|ext4|ext4dev|udf|btrfs|reiser4)
+    ext2|ext3|ext4|ext4dev|udf|btrfs|reiser4|ocfs2)
 	def_blksz=`echo $MKFS_OPTIONS| sed -rn 's/.*-b ?+([0-9]+).*/\1/p'`
 	;;
     esac
@@ -1015,6 +1015,9 @@ _scratch_mkfs_sized()
     ext2|ext3|ext4|ext4dev)
 	${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
 	;;
+    ocfs2)
+	yes | ${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
+	;;
     udf)
 	$MKFS_UDF_PROG $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV $blocks
 	;;
@@ -1087,9 +1090,12 @@ _scratch_mkfs_blocksized()
     xfs)
 	_scratch_mkfs_xfs $MKFS_OPTIONS -b size=$blocksize
 	;;
-    ext2|ext3|ext4|ocfs2)
+    ext2|ext3|ext4)
 	${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV
 	;;
+    ocfs2)
+	yes | ${MKFS_PROG}.$FSTYP -F $MKFS_OPTIONS -b $blocksize $SCRATCH_DEV
+	;;
     *)
 	_notrun "Filesystem $FSTYP not supported in _scratch_mkfs_blocksized"
 	;;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] ocfs2: convert inode refcount test to a helper
  2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
@ 2016-11-10  2:14     ` Eric Ren
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-10  2:14 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Replace the open-coded inode refcount flag test with a helper function
> to reduce the potential for bugs.
Thanks for this series;-) Some comments inline below:
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
>   fs/ocfs2/refcounttree.h |    2 ++
>   2 files changed, 17 insertions(+), 13 deletions(-)
>
>
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index 1923851..59be8f4 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -48,6 +48,12 @@
>   #include <linux/mount.h>
>   #include <linux/posix_acl.h>
>   
> +/* Does this inode have the reflink flag set? */
> +bool ocfs2_is_refcount_inode(struct inode *inode)
Should it be an inline function?

After applying this patch, looks there are still some places not being replaced with this 
function:
---
fs/ocfs2 # grep -rn "OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL"
xattr.c:2580:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
xattr.c:3611:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
file.c:1722:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
file.c:2039:        !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
refcounttree.c:55:    return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);

Eric

> +{
> +	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +}
> +
>   struct ocfs2_cow_context {
>   	struct inode *inode;
>   	u32 cow_start;
> @@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
>   		goto out;
>   	}
>   
> -	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	di = (struct ocfs2_dinode *)di_bh->b_data;
>   	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
> @@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
>   	u32 num_got;
>   	u64 suballoc_loc, first_blkno;
>   
> -	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +	BUG_ON(ocfs2_is_refcount_inode(inode));
>   
>   	trace_ocfs2_create_refcount_tree(
>   		(unsigned long long)OCFS2_I(inode)->ip_blkno);
> @@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
>   	struct ocfs2_refcount_block *rb;
>   	struct ocfs2_refcount_tree *ref_tree;
>   
> -	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +	BUG_ON(ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
>   				       &ref_tree, &ref_root_bh);
> @@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
>   	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
>   	u16 bit = 0;
>   
> -	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
> +	if (!ocfs2_is_refcount_inode(inode))
>   		return 0;
>   
>   	BUG_ON(!ref_blkno);
> @@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
>   {
>   	int ret;
>   	u64 ref_blkno;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *tree;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
>   	if (ret) {
> @@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
>   					  int *ref_blocks)
>   {
>   	int ret;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *tree;
>   	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
> @@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
>   		goto out;
>   	}
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
>   				      refcount_loc, &tree);
> @@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
>   {
>   	int ret;
>   	u32 cow_start = 0, cow_len = 0;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>   	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *ref_tree;
>   	struct ocfs2_cow_context *context = NULL;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
>   					      cpos, write_len, max_cpos,
> @@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
>   {
>   	int ret;
>   	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct ocfs2_cow_context *context = NULL;
>   	u32 cow_start, cow_len;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
>   					      cpos, write_len, UINT_MAX,
> @@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
>   
>   	ocfs2_init_dealloc_ctxt(&dealloc);
>   
> -	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
> +	if (!ocfs2_is_refcount_inode(inode)) {
>   		ret = ocfs2_create_refcount_tree(inode, di_bh);
>   		if (ret) {
>   			mlog_errno(ret);
> diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> index 6422bbc..553edfb 100644
> --- a/fs/ocfs2/refcounttree.h
> +++ b/fs/ocfs2/refcounttree.h
> @@ -17,6 +17,8 @@
>   #ifndef OCFS2_REFCOUNTTREE_H
>   #define OCFS2_REFCOUNTTREE_H
>   
> +bool ocfs2_is_refcount_inode(struct inode *inode);
> +
>   struct ocfs2_refcount_tree {
>   	struct rb_node rf_node;
>   	u64 rf_blkno;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 1/6] ocfs2: convert inode refcount test to a helper
@ 2016-11-10  2:14     ` Eric Ren
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-10  2:14 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Replace the open-coded inode refcount flag test with a helper function
> to reduce the potential for bugs.
Thanks for this series;-) Some comments inline below:
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
>   fs/ocfs2/refcounttree.h |    2 ++
>   2 files changed, 17 insertions(+), 13 deletions(-)
>
>
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index 1923851..59be8f4 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -48,6 +48,12 @@
>   #include <linux/mount.h>
>   #include <linux/posix_acl.h>
>   
> +/* Does this inode have the reflink flag set? */
> +bool ocfs2_is_refcount_inode(struct inode *inode)
Should it be an inline function?

After applying this patch, looks there are still some places not being replaced with this 
function:
---
fs/ocfs2 # grep -rn "OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL"
xattr.c:2580:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
xattr.c:3611:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
file.c:1722:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
file.c:2039:        !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
refcounttree.c:55:    return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);

Eric

> +{
> +	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +}
> +
>   struct ocfs2_cow_context {
>   	struct inode *inode;
>   	u32 cow_start;
> @@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
>   		goto out;
>   	}
>   
> -	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	di = (struct ocfs2_dinode *)di_bh->b_data;
>   	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
> @@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
>   	u32 num_got;
>   	u64 suballoc_loc, first_blkno;
>   
> -	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +	BUG_ON(ocfs2_is_refcount_inode(inode));
>   
>   	trace_ocfs2_create_refcount_tree(
>   		(unsigned long long)OCFS2_I(inode)->ip_blkno);
> @@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
>   	struct ocfs2_refcount_block *rb;
>   	struct ocfs2_refcount_tree *ref_tree;
>   
> -	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> +	BUG_ON(ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
>   				       &ref_tree, &ref_root_bh);
> @@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
>   	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
>   	u16 bit = 0;
>   
> -	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
> +	if (!ocfs2_is_refcount_inode(inode))
>   		return 0;
>   
>   	BUG_ON(!ref_blkno);
> @@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
>   {
>   	int ret;
>   	u64 ref_blkno;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *tree;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
>   	if (ret) {
> @@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
>   					  int *ref_blocks)
>   {
>   	int ret;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *tree;
>   	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
> @@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
>   		goto out;
>   	}
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
>   				      refcount_loc, &tree);
> @@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
>   {
>   	int ret;
>   	u32 cow_start = 0, cow_len = 0;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>   	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
>   	struct buffer_head *ref_root_bh = NULL;
>   	struct ocfs2_refcount_tree *ref_tree;
>   	struct ocfs2_cow_context *context = NULL;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
>   					      cpos, write_len, max_cpos,
> @@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
>   {
>   	int ret;
>   	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
> -	struct ocfs2_inode_info *oi = OCFS2_I(inode);
>   	struct ocfs2_cow_context *context = NULL;
>   	u32 cow_start, cow_len;
>   
> -	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> +	BUG_ON(!ocfs2_is_refcount_inode(inode));
>   
>   	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
>   					      cpos, write_len, UINT_MAX,
> @@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
>   
>   	ocfs2_init_dealloc_ctxt(&dealloc);
>   
> -	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
> +	if (!ocfs2_is_refcount_inode(inode)) {
>   		ret = ocfs2_create_refcount_tree(inode, di_bh);
>   		if (ret) {
>   			mlog_errno(ret);
> diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> index 6422bbc..553edfb 100644
> --- a/fs/ocfs2/refcounttree.h
> +++ b/fs/ocfs2/refcounttree.h
> @@ -17,6 +17,8 @@
>   #ifndef OCFS2_REFCOUNTTREE_H
>   #define OCFS2_REFCOUNTTREE_H
>   
> +bool ocfs2_is_refcount_inode(struct inode *inode);
> +
>   struct ocfs2_refcount_tree {
>   	struct rb_node rf_node;
>   	u64 rf_blkno;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
  2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
@ 2016-11-10  9:20     ` Darwin
  -1 siblings, 0 replies; 42+ messages in thread
From: Darwin @ 2016-11-10  9:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

Hello,

On Thu, Nov 10, 2016 at 6:51 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> When we're adding the refcount flag to an extent, we have to budget
> enough space to handle a full extent btree split in addition to

May I ask some questions (possibly stupid):
1) why and when would extend btree split? From my understanding, if I do

$cp --reflink a b

a is not inline-data. refcount tree will be allocated, every extent
record of "a" will refer to
refcount record respectively and be marked with refcounted, operations
like this also for "b".
So, I think splitting only happens when writing on them, CMIIW;-)

2) what do you mean by "*full* extent btree"?

> whatever modifications have to be made to the refcount btree.  We
> don't currently do this, with the result that generic/186 crashes
> when we need an extent split but not a refcount split because meta_ac
> never gets allocated.

3) in what situation, will this happen? - "we need an extent split but
not a refcount split".
Could you please explain more by example?

Thanks,
Darwin

>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/ocfs2/refcounttree.c |    3 +++
>  1 file changed, 3 insertions(+)
>
>
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index 59be8f4..d92b6c6 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
>         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>         struct ocfs2_alloc_context *meta_ac = NULL;
>
> +       /* We need to be able to handle at least an extent tree split. */
> +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
> +
>         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
>                                                ref_ci, ref_root_bh,
>                                                p_cluster, num_clusters,
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel



-- 
Thanks,
Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
@ 2016-11-10  9:20     ` Darwin
  0 siblings, 0 replies; 42+ messages in thread
From: Darwin @ 2016-11-10  9:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

Hello,

On Thu, Nov 10, 2016 at 6:51 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> When we're adding the refcount flag to an extent, we have to budget
> enough space to handle a full extent btree split in addition to

May I ask some questions (possibly stupid):
1) why and when would extend btree split? From my understanding, if I do

$cp --reflink a b

a is not inline-data. refcount tree will be allocated, every extent
record of "a" will refer to
refcount record respectively and be marked with refcounted, operations
like this also for "b".
So, I think splitting only happens when writing on them, CMIIW;-)

2) what do you mean by "*full* extent btree"?

> whatever modifications have to be made to the refcount btree.  We
> don't currently do this, with the result that generic/186 crashes
> when we need an extent split but not a refcount split because meta_ac
> never gets allocated.

3) in what situation, will this happen? - "we need an extent split but
not a refcount split".
Could you please explain more by example?

Thanks,
Darwin

>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/ocfs2/refcounttree.c |    3 +++
>  1 file changed, 3 insertions(+)
>
>
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index 59be8f4..d92b6c6 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
>         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>         struct ocfs2_alloc_context *meta_ac = NULL;
>
> +       /* We need to be able to handle at least an extent tree split. */
> +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
> +
>         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
>                                                ref_ci, ref_root_bh,
>                                                p_cluster, num_clusters,
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel



-- 
Thanks,
Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
  2016-11-10  9:20     ` Darwin
@ 2016-11-10 17:11       ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:11 UTC (permalink / raw)
  To: Darwin; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Thu, Nov 10, 2016 at 05:20:27PM +0800, Darwin wrote:
> Hello,
> 
> On Thu, Nov 10, 2016 at 6:51 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > When we're adding the refcount flag to an extent, we have to budget
> > enough space to handle a full extent btree split in addition to
> 
> May I ask some questions (possibly stupid):
> 1) why and when would extend btree split? From my understanding, if I do
> 
> $cp --reflink a b
> 
> a is not inline-data. refcount tree will be allocated, every extent
> record of "a" will refer to
> refcount record respectively and be marked with refcounted, operations
> like this also for "b".
> So, I think splitting only happens when writing on them, CMIIW;-)

The VFS reflink interface (FICLONERANGE) and dedupe interface (FIDEDUPERANGE)
allows callers to specify the range of bytes on which to operate, which means
that we can share arbitrary parts of two files.  For example, let's say you
have one regular extent in a file's block map:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR (regular extent)

Now ask reflink to share the middle of that extent with some other file.
ocfs2's extent mapping record can store a flag indicating that the
extent could be shared, which means that the one record now splits into
three:

RRRRRRRRRRRssssssRRRRRRRRRRRRR (regular, shared, regular)

This scenario happens if you run duperemove against an ocfs2 filesystem
to deduplicate file data, since dedupe wants the filesystem to be able
to share arbitrary blocks of files.

You're correct that cp --reflink only ever deals with entire extents
because it calls FICLONERANGE with both file offsets zero and the length
set to the length of the source file.  The important thing to remember
is that we are not limited to sharing entire files, even if the
userspace utilities don't take advantage of it.

(FWIW the duperemove program does take advantage of it.)

> 2) what do you mean by "*full* extent btree"?

(Slight correction to that -- I should have said "full extent tree split".)

Record insertion operations on a tree structure all follow the same
basic strategy -- search from the root towards the records in the leaf
blocks until we find the place where the record would be, memmove() all
the records following that spot up by one index, copy the record data
into the newly vacated slot, and (if necessary) walk back up the tree to
update the interior node pointers.

If the desired leaf is already full, however, there is a problem -- we
have to split the leaf into two half-full leaf blocks before we can
insert the record.  We must also add a pointer to the new leaf into the
next level up in the tree.  If that interior node is also full, we split
the interior node prior to adding the new leaf block pointer.  We must
then add a pointer to the new interior node into the next level up in
the tree, and so on until we reach the root.

In short, if we want to add a record to a tree whose blocks are
completely full, we end up splitting blocks all the way up the tree.

> > whatever modifications have to be made to the refcount btree.  We
> > don't currently do this, with the result that generic/186 crashes
> > when we need an extent split but not a refcount split because meta_ac
> > never gets allocated.
> 
> 3) in what situation, will this happen? - "we need an extent split but
> not a refcount split".
> Could you please explain more by example?

An extreme example would be a program like this:

- write to block zero
- for i in 1 to 524288,
  - reflink block zero to block $i

When this program terminates, the refcount tree will contain a single
refcount record ($phys_blk, len=1, refcount=524288).  The extent map for
this file, however, will have 524,288 extent records:

($phys_blk, len=1, offset=0, flags=shared)
($phys_blk, len=1, offset=1, flags=shared)
...
($phys_blk, len=1, offset=524288, flags=shared)

There are 524288 records.  A 4k leaf can fit 252 records, so there will
be 2081 leaf blocks.  A 4k node also can fit 252 records, so there will
be 9 node blocks pointing to leaves, and one root block to point to the
first level nodes.  Clearly, the extent tree has split many times across
all the reflink operations.  However, the refcount tree never splits
because there's only one record.

--D

> 
> Thanks,
> Darwin
> 
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/ocfs2/refcounttree.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> >
> > diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> > index 59be8f4..d92b6c6 100644
> > --- a/fs/ocfs2/refcounttree.c
> > +++ b/fs/ocfs2/refcounttree.c
> > @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
> >         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >         struct ocfs2_alloc_context *meta_ac = NULL;
> >
> > +       /* We need to be able to handle at least an extent tree split. */
> > +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
> > +
> >         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
> >                                                ref_ci, ref_root_bh,
> >                                                p_cluster, num_clusters,
> >
> >
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel@oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 
> 
> -- 
> Thanks,
> Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
@ 2016-11-10 17:11       ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:11 UTC (permalink / raw)
  To: Darwin; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Thu, Nov 10, 2016 at 05:20:27PM +0800, Darwin wrote:
> Hello,
> 
> On Thu, Nov 10, 2016 at 6:51 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > When we're adding the refcount flag to an extent, we have to budget
> > enough space to handle a full extent btree split in addition to
> 
> May I ask some questions (possibly stupid):
> 1) why and when would extend btree split? From my understanding, if I do
> 
> $cp --reflink a b
> 
> a is not inline-data. refcount tree will be allocated, every extent
> record of "a" will refer to
> refcount record respectively and be marked with refcounted, operations
> like this also for "b".
> So, I think splitting only happens when writing on them, CMIIW;-)

The VFS reflink interface (FICLONERANGE) and dedupe interface (FIDEDUPERANGE)
allows callers to specify the range of bytes on which to operate, which means
that we can share arbitrary parts of two files.  For example, let's say you
have one regular extent in a file's block map:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR (regular extent)

Now ask reflink to share the middle of that extent with some other file.
ocfs2's extent mapping record can store a flag indicating that the
extent could be shared, which means that the one record now splits into
three:

RRRRRRRRRRRssssssRRRRRRRRRRRRR (regular, shared, regular)

This scenario happens if you run duperemove against an ocfs2 filesystem
to deduplicate file data, since dedupe wants the filesystem to be able
to share arbitrary blocks of files.

You're correct that cp --reflink only ever deals with entire extents
because it calls FICLONERANGE with both file offsets zero and the length
set to the length of the source file.  The important thing to remember
is that we are not limited to sharing entire files, even if the
userspace utilities don't take advantage of it.

(FWIW the duperemove program does take advantage of it.)

> 2) what do you mean by "*full* extent btree"?

(Slight correction to that -- I should have said "full extent tree split".)

Record insertion operations on a tree structure all follow the same
basic strategy -- search from the root towards the records in the leaf
blocks until we find the place where the record would be, memmove() all
the records following that spot up by one index, copy the record data
into the newly vacated slot, and (if necessary) walk back up the tree to
update the interior node pointers.

If the desired leaf is already full, however, there is a problem -- we
have to split the leaf into two half-full leaf blocks before we can
insert the record.  We must also add a pointer to the new leaf into the
next level up in the tree.  If that interior node is also full, we split
the interior node prior to adding the new leaf block pointer.  We must
then add a pointer to the new interior node into the next level up in
the tree, and so on until we reach the root.

In short, if we want to add a record to a tree whose blocks are
completely full, we end up splitting blocks all the way up the tree.

> > whatever modifications have to be made to the refcount btree.  We
> > don't currently do this, with the result that generic/186 crashes
> > when we need an extent split but not a refcount split because meta_ac
> > never gets allocated.
> 
> 3) in what situation, will this happen? - "we need an extent split but
> not a refcount split".
> Could you please explain more by example?

An extreme example would be a program like this:

- write to block zero
- for i in 1 to 524288,
  - reflink block zero to block $i

When this program terminates, the refcount tree will contain a single
refcount record ($phys_blk, len=1, refcount=524288).  The extent map for
this file, however, will have 524,288 extent records:

($phys_blk, len=1, offset=0, flags=shared)
($phys_blk, len=1, offset=1, flags=shared)
...
($phys_blk, len=1, offset=524288, flags=shared)

There are 524288 records.  A 4k leaf can fit 252 records, so there will
be 2081 leaf blocks.  A 4k node also can fit 252 records, so there will
be 9 node blocks pointing to leaves, and one root block to point to the
first level nodes.  Clearly, the extent tree has split many times across
all the reflink operations.  However, the refcount tree never splits
because there's only one record.

--D

> 
> Thanks,
> Darwin
> 
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/ocfs2/refcounttree.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> >
> > diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> > index 59be8f4..d92b6c6 100644
> > --- a/fs/ocfs2/refcounttree.c
> > +++ b/fs/ocfs2/refcounttree.c
> > @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
> >         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >         struct ocfs2_alloc_context *meta_ac = NULL;
> >
> > +       /* We need to be able to handle at least an extent tree split. */
> > +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
> > +
> >         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
> >                                                ref_ci, ref_root_bh,
> >                                                p_cluster, num_clusters,
> >
> >
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 
> 
> -- 
> Thanks,
> Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] ocfs2: convert inode refcount test to a helper
  2016-11-10  2:14     ` [Ocfs2-devel] " Eric Ren
@ 2016-11-10 17:51       ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:51 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Thu, Nov 10, 2016 at 10:14:48AM +0800, Eric Ren wrote:
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Replace the open-coded inode refcount flag test with a helper function
> >to reduce the potential for bugs.
> Thanks for this series;-) Some comments inline below:
> >
> >Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >---
> >  fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
> >  fs/ocfs2/refcounttree.h |    2 ++
> >  2 files changed, 17 insertions(+), 13 deletions(-)
> >
> >
> >diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> >index 1923851..59be8f4 100644
> >--- a/fs/ocfs2/refcounttree.c
> >+++ b/fs/ocfs2/refcounttree.c
> >@@ -48,6 +48,12 @@
> >  #include <linux/mount.h>
> >  #include <linux/posix_acl.h>
> >+/* Does this inode have the reflink flag set? */
> >+bool ocfs2_is_refcount_inode(struct inode *inode)
> Should it be an inline function?

Yes, it can be an inline function.

> After applying this patch, looks there are still some places not being
> replaced with this function:
> ---
> fs/ocfs2 # grep -rn "OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL"
> xattr.c:2580:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
> xattr.c:3611:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
> file.c:1722:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
> file.c:2039:        !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
> refcounttree.c:55:    return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);

Oops.  Yeah, I missed those.  Will send out a v2 patch.

--D

> 
> Eric
> 
> >+{
> >+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+}
> >+
> >  struct ocfs2_cow_context {
> >  	struct inode *inode;
> >  	u32 cow_start;
> >@@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
> >  		goto out;
> >  	}
> >-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	di = (struct ocfs2_dinode *)di_bh->b_data;
> >  	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
> >@@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
> >  	u32 num_got;
> >  	u64 suballoc_loc, first_blkno;
> >-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+	BUG_ON(ocfs2_is_refcount_inode(inode));
> >  	trace_ocfs2_create_refcount_tree(
> >  		(unsigned long long)OCFS2_I(inode)->ip_blkno);
> >@@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
> >  	struct ocfs2_refcount_block *rb;
> >  	struct ocfs2_refcount_tree *ref_tree;
> >-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+	BUG_ON(ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
> >  				       &ref_tree, &ref_root_bh);
> >@@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
> >  	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
> >  	u16 bit = 0;
> >-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
> >+	if (!ocfs2_is_refcount_inode(inode))
> >  		return 0;
> >  	BUG_ON(!ref_blkno);
> >@@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
> >  {
> >  	int ret;
> >  	u64 ref_blkno;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *tree;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
> >  	if (ret) {
> >@@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
> >  					  int *ref_blocks)
> >  {
> >  	int ret;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *tree;
> >  	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
> >@@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
> >  		goto out;
> >  	}
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
> >  				      refcount_loc, &tree);
> >@@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
> >  {
> >  	int ret;
> >  	u32 cow_start = 0, cow_len = 0;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >  	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *ref_tree;
> >  	struct ocfs2_cow_context *context = NULL;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
> >  					      cpos, write_len, max_cpos,
> >@@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
> >  {
> >  	int ret;
> >  	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct ocfs2_cow_context *context = NULL;
> >  	u32 cow_start, cow_len;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
> >  					      cpos, write_len, UINT_MAX,
> >@@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
> >  	ocfs2_init_dealloc_ctxt(&dealloc);
> >-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
> >+	if (!ocfs2_is_refcount_inode(inode)) {
> >  		ret = ocfs2_create_refcount_tree(inode, di_bh);
> >  		if (ret) {
> >  			mlog_errno(ret);
> >diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> >index 6422bbc..553edfb 100644
> >--- a/fs/ocfs2/refcounttree.h
> >+++ b/fs/ocfs2/refcounttree.h
> >@@ -17,6 +17,8 @@
> >  #ifndef OCFS2_REFCOUNTTREE_H
> >  #define OCFS2_REFCOUNTTREE_H
> >+bool ocfs2_is_refcount_inode(struct inode *inode);
> >+
> >  struct ocfs2_refcount_tree {
> >  	struct rb_node rf_node;
> >  	u64 rf_blkno;
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 1/6] ocfs2: convert inode refcount test to a helper
@ 2016-11-10 17:51       ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:51 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Thu, Nov 10, 2016 at 10:14:48AM +0800, Eric Ren wrote:
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Replace the open-coded inode refcount flag test with a helper function
> >to reduce the potential for bugs.
> Thanks for this series;-) Some comments inline below:
> >
> >Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >---
> >  fs/ocfs2/refcounttree.c |   28 +++++++++++++++-------------
> >  fs/ocfs2/refcounttree.h |    2 ++
> >  2 files changed, 17 insertions(+), 13 deletions(-)
> >
> >
> >diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> >index 1923851..59be8f4 100644
> >--- a/fs/ocfs2/refcounttree.c
> >+++ b/fs/ocfs2/refcounttree.c
> >@@ -48,6 +48,12 @@
> >  #include <linux/mount.h>
> >  #include <linux/posix_acl.h>
> >+/* Does this inode have the reflink flag set? */
> >+bool ocfs2_is_refcount_inode(struct inode *inode)
> Should it be an inline function?

Yes, it can be an inline function.

> After applying this patch, looks there are still some places not being
> replaced with this function:
> ---
> fs/ocfs2 # grep -rn "OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL"
> xattr.c:2580:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
> xattr.c:3611:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
> file.c:1722:    if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
> file.c:2039:        !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
> refcounttree.c:55:    return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);

Oops.  Yeah, I missed those.  Will send out a v2 patch.

--D

> 
> Eric
> 
> >+{
> >+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+}
> >+
> >  struct ocfs2_cow_context {
> >  	struct inode *inode;
> >  	u32 cow_start;
> >@@ -410,7 +416,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
> >  		goto out;
> >  	}
> >-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	di = (struct ocfs2_dinode *)di_bh->b_data;
> >  	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
> >@@ -570,7 +576,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
> >  	u32 num_got;
> >  	u64 suballoc_loc, first_blkno;
> >-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+	BUG_ON(ocfs2_is_refcount_inode(inode));
> >  	trace_ocfs2_create_refcount_tree(
> >  		(unsigned long long)OCFS2_I(inode)->ip_blkno);
> >@@ -708,7 +714,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
> >  	struct ocfs2_refcount_block *rb;
> >  	struct ocfs2_refcount_tree *ref_tree;
> >-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
> >+	BUG_ON(ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
> >  				       &ref_tree, &ref_root_bh);
> >@@ -775,7 +781,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
> >  	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
> >  	u16 bit = 0;
> >-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
> >+	if (!ocfs2_is_refcount_inode(inode))
> >  		return 0;
> >  	BUG_ON(!ref_blkno);
> >@@ -2299,11 +2305,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
> >  {
> >  	int ret;
> >  	u64 ref_blkno;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *tree;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
> >  	if (ret) {
> >@@ -2533,7 +2538,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
> >  					  int *ref_blocks)
> >  {
> >  	int ret;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *tree;
> >  	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
> >@@ -2544,7 +2548,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
> >  		goto out;
> >  	}
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
> >  				      refcount_loc, &tree);
> >@@ -3412,14 +3416,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
> >  {
> >  	int ret;
> >  	u32 cow_start = 0, cow_len = 0;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >  	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
> >  	struct buffer_head *ref_root_bh = NULL;
> >  	struct ocfs2_refcount_tree *ref_tree;
> >  	struct ocfs2_cow_context *context = NULL;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
> >  					      cpos, write_len, max_cpos,
> >@@ -3629,11 +3632,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
> >  {
> >  	int ret;
> >  	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
> >-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  	struct ocfs2_cow_context *context = NULL;
> >  	u32 cow_start, cow_len;
> >-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
> >+	BUG_ON(!ocfs2_is_refcount_inode(inode));
> >  	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
> >  					      cpos, write_len, UINT_MAX,
> >@@ -3807,7 +3809,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
> >  	ocfs2_init_dealloc_ctxt(&dealloc);
> >-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
> >+	if (!ocfs2_is_refcount_inode(inode)) {
> >  		ret = ocfs2_create_refcount_tree(inode, di_bh);
> >  		if (ret) {
> >  			mlog_errno(ret);
> >diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> >index 6422bbc..553edfb 100644
> >--- a/fs/ocfs2/refcounttree.h
> >+++ b/fs/ocfs2/refcounttree.h
> >@@ -17,6 +17,8 @@
> >  #ifndef OCFS2_REFCOUNTTREE_H
> >  #define OCFS2_REFCOUNTTREE_H
> >+bool ocfs2_is_refcount_inode(struct inode *inode);
> >+
> >  struct ocfs2_refcount_tree {
> >  	struct rb_node rf_node;
> >  	u64 rf_blkno;
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >the body of a message to majordomo at vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 1/6] ocfs2: convert inode refcount test to a helper
  2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
@ 2016-11-10 17:52     ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:52 UTC (permalink / raw)
  To: mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/alloc.c        |    3 +--
 fs/ocfs2/file.c         |    7 +++----
 fs/ocfs2/inode.h        |    6 ++++++
 fs/ocfs2/move_extents.c |   10 ++--------
 fs/ocfs2/refcounttree.c |   22 +++++++++-------------
 fs/ocfs2/xattr.c        |    4 ++--
 6 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f72712f..a0ca49f 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5713,8 +5713,7 @@ int ocfs2_remove_btree_range(struct inode *inode,
 	struct ocfs2_refcount_tree *ref_tree = NULL;
 
 	if ((flags & OCFS2_EXT_REFCOUNTED) && len) {
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 		if (!refcount_tree_locked) {
 			ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 000c234..d261f3a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1030,7 +1030,7 @@ int ocfs2_extend_no_holes(struct inode *inode, struct buffer_head *di_bh,
 	 * Only quota files call this without a bh, and they can't be
 	 * refcounted.
 	 */
-	BUG_ON(!di_bh && (oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!di_bh && ocfs2_is_refcount_inode(inode));
 	BUG_ON(!di_bh && !(oi->ip_flags & OCFS2_INODE_SYSTEM_FILE));
 
 	clusters_to_add = ocfs2_clusters_for_bytes(inode->i_sb, new_i_size);
@@ -1719,8 +1719,7 @@ static int ocfs2_remove_inode_range(struct inode *inode,
 	 * within one cluster(means is not exactly aligned to clustersize).
 	 */
 
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
-
+	if (ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_cow_file_pos(inode, di_bh, byte_start);
 		if (ret) {
 			mlog_errno(ret);
@@ -2036,7 +2035,7 @@ int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 	struct super_block *sb = inode->i_sb;
 
 	if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)) ||
-	    !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
+	    !ocfs2_is_refcount_inode(inode) ||
 	    OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
 		return 0;
 
diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h
index 5af68fc..9b955f7 100644
--- a/fs/ocfs2/inode.h
+++ b/fs/ocfs2/inode.h
@@ -181,4 +181,10 @@ static inline struct ocfs2_inode_info *cache_info_to_inode(struct ocfs2_caching_
 	return container_of(ci, struct ocfs2_inode_info, ip_metadata_cache);
 }
 
+/* Does this inode have the reflink flag set? */
+static inline bool ocfs2_is_refcount_inode(struct inode *inode)
+{
+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+}
+
 #endif /* OCFS2_INODE_H */
diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 4e8f32eb..e52a285 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -235,10 +235,7 @@ static int ocfs2_defrag_extent(struct ocfs2_move_extents_context *context,
 	u64 phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos);
 
 	if ((ext_flags & OCFS2_EXT_REFCOUNTED) && *len) {
-
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
-
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 		BUG_ON(!context->refcount_loc);
 
 		ret = ocfs2_lock_refcount_tree(osb, context->refcount_loc, 1,
@@ -581,10 +578,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
 	phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos);
 
 	if ((ext_flags & OCFS2_EXT_REFCOUNTED) && len) {
-
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
-
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 		BUG_ON(!context->refcount_loc);
 
 		ret = ocfs2_lock_refcount_tree(osb, context->refcount_loc, 1,
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 1923851..3410eb1 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -410,7 +410,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
 		goto out;
 	}
 
-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	di = (struct ocfs2_dinode *)di_bh->b_data;
 	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
@@ -570,7 +570,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
 	u32 num_got;
 	u64 suballoc_loc, first_blkno;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	trace_ocfs2_create_refcount_tree(
 		(unsigned long long)OCFS2_I(inode)->ip_blkno);
@@ -708,7 +708,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
 	struct ocfs2_refcount_block *rb;
 	struct ocfs2_refcount_tree *ref_tree;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
 				       &ref_tree, &ref_root_bh);
@@ -775,7 +775,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
 	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
 	u16 bit = 0;
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
+	if (!ocfs2_is_refcount_inode(inode))
 		return 0;
 
 	BUG_ON(!ref_blkno);
@@ -2299,11 +2299,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
 {
 	int ret;
 	u64 ref_blkno;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
 	if (ret) {
@@ -2533,7 +2532,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 					  int *ref_blocks)
 {
 	int ret;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
@@ -2544,7 +2542,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 		goto out;
 	}
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
 				      refcount_loc, &tree);
@@ -3412,14 +3410,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
 {
 	int ret;
 	u32 cow_start = 0, cow_len = 0;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_cow_context *context = NULL;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
 					      cpos, write_len, max_cpos,
@@ -3629,11 +3626,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
 {
 	int ret;
 	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_cow_context *context = NULL;
 	u32 cow_start, cow_len;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
 					      cpos, write_len, UINT_MAX,
@@ -3807,7 +3803,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
 
 	ocfs2_init_dealloc_ctxt(&dealloc);
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
+	if (!ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_create_refcount_tree(inode, di_bh);
 		if (ret) {
 			mlog_errno(ret);
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index cb157a3..3c5384d 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -2577,7 +2577,7 @@ int ocfs2_xattr_remove(struct inode *inode, struct buffer_head *di_bh)
 	if (!(oi->ip_dyn_features & OCFS2_HAS_XATTR_FL))
 		return 0;
 
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
+	if (ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_lock_refcount_tree(OCFS2_SB(inode->i_sb),
 					       le64_to_cpu(di->i_refcount_loc),
 					       1, &ref_tree, &ref_root_bh);
@@ -3608,7 +3608,7 @@ int ocfs2_xattr_set(struct inode *inode,
 	}
 
 	/* Check whether the value is refcounted and do some preparation. */
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
+	if (ocfs2_is_refcount_inode(inode) &&
 	    (!xis.not_found || !xbs.not_found)) {
 		ret = ocfs2_prepare_refcount_xattr(inode, di, &xi,
 						   &xis, &xbs, &ref_tree,

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH v2 1/6] ocfs2: convert inode refcount test to a helper
@ 2016-11-10 17:52     ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-10 17:52 UTC (permalink / raw)
  To: mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/alloc.c        |    3 +--
 fs/ocfs2/file.c         |    7 +++----
 fs/ocfs2/inode.h        |    6 ++++++
 fs/ocfs2/move_extents.c |   10 ++--------
 fs/ocfs2/refcounttree.c |   22 +++++++++-------------
 fs/ocfs2/xattr.c        |    4 ++--
 6 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f72712f..a0ca49f 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5713,8 +5713,7 @@ int ocfs2_remove_btree_range(struct inode *inode,
 	struct ocfs2_refcount_tree *ref_tree = NULL;
 
 	if ((flags & OCFS2_EXT_REFCOUNTED) && len) {
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 		if (!refcount_tree_locked) {
 			ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 000c234..d261f3a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1030,7 +1030,7 @@ int ocfs2_extend_no_holes(struct inode *inode, struct buffer_head *di_bh,
 	 * Only quota files call this without a bh, and they can't be
 	 * refcounted.
 	 */
-	BUG_ON(!di_bh && (oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!di_bh && ocfs2_is_refcount_inode(inode));
 	BUG_ON(!di_bh && !(oi->ip_flags & OCFS2_INODE_SYSTEM_FILE));
 
 	clusters_to_add = ocfs2_clusters_for_bytes(inode->i_sb, new_i_size);
@@ -1719,8 +1719,7 @@ static int ocfs2_remove_inode_range(struct inode *inode,
 	 * within one cluster(means is not exactly aligned to clustersize).
 	 */
 
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
-
+	if (ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_cow_file_pos(inode, di_bh, byte_start);
 		if (ret) {
 			mlog_errno(ret);
@@ -2036,7 +2035,7 @@ int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 	struct super_block *sb = inode->i_sb;
 
 	if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)) ||
-	    !(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) ||
+	    !ocfs2_is_refcount_inode(inode) ||
 	    OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
 		return 0;
 
diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h
index 5af68fc..9b955f7 100644
--- a/fs/ocfs2/inode.h
+++ b/fs/ocfs2/inode.h
@@ -181,4 +181,10 @@ static inline struct ocfs2_inode_info *cache_info_to_inode(struct ocfs2_caching_
 	return container_of(ci, struct ocfs2_inode_info, ip_metadata_cache);
 }
 
+/* Does this inode have the reflink flag set? */
+static inline bool ocfs2_is_refcount_inode(struct inode *inode)
+{
+	return (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+}
+
 #endif /* OCFS2_INODE_H */
diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 4e8f32eb..e52a285 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -235,10 +235,7 @@ static int ocfs2_defrag_extent(struct ocfs2_move_extents_context *context,
 	u64 phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos);
 
 	if ((ext_flags & OCFS2_EXT_REFCOUNTED) && *len) {
-
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
-
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 		BUG_ON(!context->refcount_loc);
 
 		ret = ocfs2_lock_refcount_tree(osb, context->refcount_loc, 1,
@@ -581,10 +578,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
 	phys_blkno = ocfs2_clusters_to_blocks(inode->i_sb, phys_cpos);
 
 	if ((ext_flags & OCFS2_EXT_REFCOUNTED) && len) {
-
-		BUG_ON(!(OCFS2_I(inode)->ip_dyn_features &
-			 OCFS2_HAS_REFCOUNT_FL));
-
+		BUG_ON(!ocfs2_is_refcount_inode(inode));
 		BUG_ON(!context->refcount_loc);
 
 		ret = ocfs2_lock_refcount_tree(osb, context->refcount_loc, 1,
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 1923851..3410eb1 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -410,7 +410,7 @@ static int ocfs2_get_refcount_block(struct inode *inode, u64 *ref_blkno)
 		goto out;
 	}
 
-	BUG_ON(!(OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	di = (struct ocfs2_dinode *)di_bh->b_data;
 	*ref_blkno = le64_to_cpu(di->i_refcount_loc);
@@ -570,7 +570,7 @@ static int ocfs2_create_refcount_tree(struct inode *inode,
 	u32 num_got;
 	u64 suballoc_loc, first_blkno;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	trace_ocfs2_create_refcount_tree(
 		(unsigned long long)OCFS2_I(inode)->ip_blkno);
@@ -708,7 +708,7 @@ static int ocfs2_set_refcount_tree(struct inode *inode,
 	struct ocfs2_refcount_block *rb;
 	struct ocfs2_refcount_tree *ref_tree;
 
-	BUG_ON(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL);
+	BUG_ON(ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_lock_refcount_tree(osb, refcount_loc, 1,
 				       &ref_tree, &ref_root_bh);
@@ -775,7 +775,7 @@ int ocfs2_remove_refcount_tree(struct inode *inode, struct buffer_head *di_bh)
 	u64 blk = 0, bg_blkno = 0, ref_blkno = le64_to_cpu(di->i_refcount_loc);
 	u16 bit = 0;
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL))
+	if (!ocfs2_is_refcount_inode(inode))
 		return 0;
 
 	BUG_ON(!ref_blkno);
@@ -2299,11 +2299,10 @@ int ocfs2_decrease_refcount(struct inode *inode,
 {
 	int ret;
 	u64 ref_blkno;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_block(inode, &ref_blkno);
 	if (ret) {
@@ -2533,7 +2532,6 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 					  int *ref_blocks)
 {
 	int ret;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *tree;
 	u64 start_cpos = ocfs2_blocks_to_clusters(inode->i_sb, phys_blkno);
@@ -2544,7 +2542,7 @@ int ocfs2_prepare_refcount_change_for_del(struct inode *inode,
 		goto out;
 	}
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_get_refcount_tree(OCFS2_SB(inode->i_sb),
 				      refcount_loc, &tree);
@@ -3412,14 +3410,13 @@ static int ocfs2_refcount_cow_hunk(struct inode *inode,
 {
 	int ret;
 	u32 cow_start = 0, cow_len = 0;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data;
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_cow_context *context = NULL;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &di->id2.i_list,
 					      cpos, write_len, max_cpos,
@@ -3629,11 +3626,10 @@ int ocfs2_refcount_cow_xattr(struct inode *inode,
 {
 	int ret;
 	struct ocfs2_xattr_value_root *xv = vb->vb_xv;
-	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_cow_context *context = NULL;
 	u32 cow_start, cow_len;
 
-	BUG_ON(!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL));
+	BUG_ON(!ocfs2_is_refcount_inode(inode));
 
 	ret = ocfs2_refcount_cal_cow_clusters(inode, &xv->xr_list,
 					      cpos, write_len, UINT_MAX,
@@ -3807,7 +3803,7 @@ static int ocfs2_attach_refcount_tree(struct inode *inode,
 
 	ocfs2_init_dealloc_ctxt(&dealloc);
 
-	if (!(oi->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL)) {
+	if (!ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_create_refcount_tree(inode, di_bh);
 		if (ret) {
 			mlog_errno(ret);
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index cb157a3..3c5384d 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -2577,7 +2577,7 @@ int ocfs2_xattr_remove(struct inode *inode, struct buffer_head *di_bh)
 	if (!(oi->ip_dyn_features & OCFS2_HAS_XATTR_FL))
 		return 0;
 
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL) {
+	if (ocfs2_is_refcount_inode(inode)) {
 		ret = ocfs2_lock_refcount_tree(OCFS2_SB(inode->i_sb),
 					       le64_to_cpu(di->i_refcount_loc),
 					       1, &ref_tree, &ref_root_bh);
@@ -3608,7 +3608,7 @@ int ocfs2_xattr_set(struct inode *inode,
 	}
 
 	/* Check whether the value is refcounted and do some preparation. */
-	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_REFCOUNT_FL &&
+	if (ocfs2_is_refcount_inode(inode) &&
 	    (!xis.not_found || !xbs.not_found)) {
 		ret = ocfs2_prepare_refcount_xattr(inode, di, &xi,
 						   &xis, &xbs, &ref_tree,

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
  2016-11-10 17:11       ` Darrick J. Wong
@ 2016-11-11  3:00         ` Darwin
  -1 siblings, 0 replies; 42+ messages in thread
From: Darwin @ 2016-11-11  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

Hi Darrick,

On Fri, Nov 11, 2016 at 1:11 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
[snip]
> The VFS reflink interface (FICLONERANGE) and dedupe interface (FIDEDUPERANGE)
> allows callers to specify the range of bytes on which to operate, which means
> that we can share arbitrary parts of two files.  For example, let's say you
> have one regular extent in a file's block map:
>
> RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR (regular extent)
>
> Now ask reflink to share the middle of that extent with some other file.
> ocfs2's extent mapping record can store a flag indicating that the
> extent could be shared, which means that the one record now splits into
> three:
>
> RRRRRRRRRRRssssssRRRRRRRRRRRRR (regular, shared, regular)
>
> This scenario happens if you run duperemove against an ocfs2 filesystem
> to deduplicate file data, since dedupe wants the filesystem to be able
> to share arbitrary blocks of files.
>
> You're correct that cp --reflink only ever deals with entire extents
> because it calls FICLONERANGE with both file offsets zero and the length
> set to the length of the source file.  The important thing to remember
> is that we are not limited to sharing entire files, even if the
> userspace utilities don't take advantage of it.

Thanks for your time! I thought it's for an existed ocfs2 issue until reaching
patch[6/6].

>
> (FWIW the duperemove program does take advantage of it.)
>
>> 2) what do you mean by "*full* extent btree"?
>
> (Slight correction to that -- I should have said "full extent tree split".)
>
> Record insertion operations on a tree structure all follow the same
> basic strategy -- search from the root towards the records in the leaf
> blocks until we find the place where the record would be, memmove() all
> the records following that spot up by one index, copy the record data
> into the newly vacated slot, and (if necessary) walk back up the tree to
> update the interior node pointers.
>
> If the desired leaf is already full, however, there is a problem -- we
> have to split the leaf into two half-full leaf blocks before we can
> insert the record.  We must also add a pointer to the new leaf into the
> next level up in the tree.  If that interior node is also full, we split
> the interior node prior to adding the new leaf block pointer.  We must
> then add a pointer to the new interior node into the next level up in
> the tree, and so on until we reach the root.
>
> In short, if we want to add a record to a tree whose blocks are
> completely full, we end up splitting blocks all the way up the tree.
>

Oh yes, thanks for pointing me to the basic tree operation - record insertion;-)

>>
>> 3) in what situation, will this happen? - "we need an extent split but
>> not a refcount split".
>> Could you please explain more by example?
>
> An extreme example would be a program like this:
>
> - write to block zero
> - for i in 1 to 524288,
>   - reflink block zero to block $i
>
> When this program terminates, the refcount tree will contain a single
> refcount record ($phys_blk, len=1, refcount=524288).  The extent map for
> this file, however, will have 524,288 extent records:
>
> ($phys_blk, len=1, offset=0, flags=shared)
> ($phys_blk, len=1, offset=1, flags=shared)
> ...
> ($phys_blk, len=1, offset=524288, flags=shared)
>
> There are 524288 records.  A 4k leaf can fit 252 records, so there will
> be 2081 leaf blocks.  A 4k node also can fit 252 records, so there will
> be 9 node blocks pointing to leaves, and one root block to point to the
> first level nodes.  Clearly, the extent tree has split many times across
> all the reflink operations.  However, the refcount tree never splits
> because there's only one record.

Excellent explanation! Now, I know that virtual blocks of a file can
share the same physical blocks.

Thanks a lot!
Darwin

>
> --D
>
>>
>> Thanks,
>> Darwin
>>
>> >
>> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>> > ---
>> >  fs/ocfs2/refcounttree.c |    3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> >
>> > diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
>> > index 59be8f4..d92b6c6 100644
>> > --- a/fs/ocfs2/refcounttree.c
>> > +++ b/fs/ocfs2/refcounttree.c
>> > @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
>> >         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> >         struct ocfs2_alloc_context *meta_ac = NULL;
>> >
>> > +       /* We need to be able to handle at least an extent tree split. */
>> > +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
>> > +
>> >         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
>> >                                                ref_ci, ref_root_bh,
>> >                                                p_cluster, num_clusters,
>> >
>> >
>> > _______________________________________________
>> > Ocfs2-devel mailing list
>> > Ocfs2-devel@oss.oracle.com
>> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>>
>>
>> --
>> Thanks,
>> Darwin



-- 
Thanks,
Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag
@ 2016-11-11  3:00         ` Darwin
  0 siblings, 0 replies; 42+ messages in thread
From: Darwin @ 2016-11-11  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

Hi Darrick,

On Fri, Nov 11, 2016 at 1:11 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
[snip]
> The VFS reflink interface (FICLONERANGE) and dedupe interface (FIDEDUPERANGE)
> allows callers to specify the range of bytes on which to operate, which means
> that we can share arbitrary parts of two files.  For example, let's say you
> have one regular extent in a file's block map:
>
> RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR (regular extent)
>
> Now ask reflink to share the middle of that extent with some other file.
> ocfs2's extent mapping record can store a flag indicating that the
> extent could be shared, which means that the one record now splits into
> three:
>
> RRRRRRRRRRRssssssRRRRRRRRRRRRR (regular, shared, regular)
>
> This scenario happens if you run duperemove against an ocfs2 filesystem
> to deduplicate file data, since dedupe wants the filesystem to be able
> to share arbitrary blocks of files.
>
> You're correct that cp --reflink only ever deals with entire extents
> because it calls FICLONERANGE with both file offsets zero and the length
> set to the length of the source file.  The important thing to remember
> is that we are not limited to sharing entire files, even if the
> userspace utilities don't take advantage of it.

Thanks for your time! I thought it's for an existed ocfs2 issue until reaching
patch[6/6].

>
> (FWIW the duperemove program does take advantage of it.)
>
>> 2) what do you mean by "*full* extent btree"?
>
> (Slight correction to that -- I should have said "full extent tree split".)
>
> Record insertion operations on a tree structure all follow the same
> basic strategy -- search from the root towards the records in the leaf
> blocks until we find the place where the record would be, memmove() all
> the records following that spot up by one index, copy the record data
> into the newly vacated slot, and (if necessary) walk back up the tree to
> update the interior node pointers.
>
> If the desired leaf is already full, however, there is a problem -- we
> have to split the leaf into two half-full leaf blocks before we can
> insert the record.  We must also add a pointer to the new leaf into the
> next level up in the tree.  If that interior node is also full, we split
> the interior node prior to adding the new leaf block pointer.  We must
> then add a pointer to the new interior node into the next level up in
> the tree, and so on until we reach the root.
>
> In short, if we want to add a record to a tree whose blocks are
> completely full, we end up splitting blocks all the way up the tree.
>

Oh yes, thanks for pointing me to the basic tree operation - record insertion;-)

>>
>> 3) in what situation, will this happen? - "we need an extent split but
>> not a refcount split".
>> Could you please explain more by example?
>
> An extreme example would be a program like this:
>
> - write to block zero
> - for i in 1 to 524288,
>   - reflink block zero to block $i
>
> When this program terminates, the refcount tree will contain a single
> refcount record ($phys_blk, len=1, refcount=524288).  The extent map for
> this file, however, will have 524,288 extent records:
>
> ($phys_blk, len=1, offset=0, flags=shared)
> ($phys_blk, len=1, offset=1, flags=shared)
> ...
> ($phys_blk, len=1, offset=524288, flags=shared)
>
> There are 524288 records.  A 4k leaf can fit 252 records, so there will
> be 2081 leaf blocks.  A 4k node also can fit 252 records, so there will
> be 9 node blocks pointing to leaves, and one root block to point to the
> first level nodes.  Clearly, the extent tree has split many times across
> all the reflink operations.  However, the refcount tree never splits
> because there's only one record.

Excellent explanation! Now, I know that virtual blocks of a file can
share the same physical blocks.

Thanks a lot!
Darwin

>
> --D
>
>>
>> Thanks,
>> Darwin
>>
>> >
>> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>> > ---
>> >  fs/ocfs2/refcounttree.c |    3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> >
>> > diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
>> > index 59be8f4..d92b6c6 100644
>> > --- a/fs/ocfs2/refcounttree.c
>> > +++ b/fs/ocfs2/refcounttree.c
>> > @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
>> >         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> >         struct ocfs2_alloc_context *meta_ac = NULL;
>> >
>> > +       /* We need to be able to handle at least an extent tree split. */
>> > +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
>> > +
>> >         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
>> >                                                ref_ci, ref_root_bh,
>> >                                                p_cluster, num_clusters,
>> >
>> >
>> > _______________________________________________
>> > Ocfs2-devel mailing list
>> > Ocfs2-devel at oss.oracle.com
>> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>>
>>
>> --
>> Thanks,
>> Darwin



-- 
Thanks,
Darwin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range
  2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
@ 2016-11-11  3:15   ` Eric Ren
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  3:15 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Hi,

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Hi all,
>
> These patches wire up the existing ocfs2 reflinking capabilities to
> the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
> patches clean up some minor bugs that I found; the last kernel patch
> contains the new code.
>
> A few minor fixes to xfstests are needed to make more of the tests
> run.  I'll tack that patch on the end.

FYI, reflink testcases from ocfs2-test both on single and multiple node(s)
all passed with your patches. At least, it shows that no obvious regression issue
is observed so far ;-)

Eric
>
> --D
>
> [1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range
@ 2016-11-11  3:15   ` Eric Ren
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  3:15 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Hi,

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Hi all,
>
> These patches wire up the existing ocfs2 reflinking capabilities to
> the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
> patches clean up some minor bugs that I found; the last kernel patch
> contains the new code.
>
> A few minor fixes to xfstests are needed to make more of the tests
> run.  I'll tack that patch on the end.

FYI, reflink testcases from ocfs2-test both on single and multiple node(s)
all passed with your patches. At least, it shows that no obvious regression issue
is observed so far ;-)

Eric
>
> --D
>
> [1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
@ 2016-11-11  5:49     ` Eric Ren
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  5:49 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Hi,

A few issues obvious to me:

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Connect the new VFS clone_range, copy_range, and dedupe_range features
> to the existing reflink capability of ocfs2.  Compared to the existing
> ocfs2 reflink ioctl We have to do things a little differently to support
> the VFS semantics (we can clone subranges of a file but we don't clone
> xattrs), but the VFS ioctls are more broadly supported.

How can I test the new ocfs2 reflink (with this patch) manually? What commands should I
use to do xxx_range things?

>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/ocfs2/file.c         |   62 ++++-
>   fs/ocfs2/file.h         |    3
>   fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
>   fs/ocfs2/refcounttree.h |    7 +
>   4 files changed, 688 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 000c234..d5a022d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
>   	*done = ret;
>   }
>   
> -static int ocfs2_remove_inode_range(struct inode *inode,
> -				    struct buffer_head *di_bh, u64 byte_start,
> -				    u64 byte_len)
> +int ocfs2_remove_inode_range(struct inode *inode,
> +			     struct buffer_head *di_bh, u64 byte_start,
> +			     u64 byte_len)
>   {
>   	int ret = 0, flags = 0, done = 0, i;
>   	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
> @@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
>   	return offset;
>   }
>   
> +static ssize_t ocfs2_file_copy_range(struct file *file_in,
> +				     loff_t pos_in,
> +				     struct file *file_out,
> +				     loff_t pos_out,
> +				     size_t len,
> +				     unsigned int flags)
> +{
> +	int error;
> +
> +	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +					  len, false);
> +	if (error)
> +		return error;
> +	return len;
> +}
> +
> +static int ocfs2_file_clone_range(struct file *file_in,
> +				  loff_t pos_in,
> +				  struct file *file_out,
> +				  loff_t pos_out,
> +				  u64 len)
> +{
> +	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +					 len, false);
> +}
> +
> +#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> +static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
> +				       u64 loff,
> +				       u64 len,
> +				       struct file *dst_file,
> +				       u64 dst_loff)
> +{
> +	int error;
> +
> +	/*
> +	 * Limit the total length we will dedupe for each operation.
> +	 * This is intended to bound the total time spent in this
> +	 * ioctl to something sane.
> +	 */
> +	if (len > OCFS2_MAX_DEDUPE_LEN)
> +		len = OCFS2_MAX_DEDUPE_LEN;
> +
> +	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
> +					  len, true);
> +	if (error)
> +		return error;
> +	return len;
> +}
> +
>   const struct inode_operations ocfs2_file_iops = {
>   	.setattr	= ocfs2_setattr,
>   	.getattr	= ocfs2_getattr,
> @@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
>   	.splice_read	= generic_file_splice_read,
>   	.splice_write	= iter_file_splice_write,
>   	.fallocate	= ocfs2_fallocate,
> +	.copy_file_range = ocfs2_file_copy_range,
> +	.clone_file_range = ocfs2_file_clone_range,
> +	.dedupe_file_range = ocfs2_file_dedupe_range,
>   };
>   
>   const struct file_operations ocfs2_dops = {
> @@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
>   	.splice_read	= generic_file_splice_read,
>   	.splice_write	= iter_file_splice_write,
>   	.fallocate	= ocfs2_fallocate,
> +	.copy_file_range = ocfs2_file_copy_range,
> +	.clone_file_range = ocfs2_file_clone_range,
> +	.dedupe_file_range = ocfs2_file_dedupe_range,
>   };
>   
>   const struct file_operations ocfs2_dops_no_plocks = {
> diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
> index e8c62f2..897fd9a 100644
> --- a/fs/ocfs2/file.h
> +++ b/fs/ocfs2/file.h
> @@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
>   
>   int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
>   				   size_t count);
> +int ocfs2_remove_inode_range(struct inode *inode,
> +			     struct buffer_head *di_bh, u64 byte_start,
> +			     u64 byte_len);
>   #endif /* OCFS2_FILE_H */
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index d92b6c6..3e2198c 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -34,6 +34,7 @@
>   #include "xattr.h"
>   #include "namei.h"
>   #include "ocfs2_trace.h"
> +#include "file.h"
>   
>   #include <linux/bio.h>
>   #include <linux/blkdev.h>
> @@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
>   
>   	return error;
>   }
> +
> +/* Update destination inode size, if necessary. */
> +static int ocfs2_reflink_update_dest(struct inode *dest,
> +				     struct buffer_head *d_bh,
> +				     loff_t newlen)
> +{
> +	handle_t *handle;
> +	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
> +	int ret;
> +
> +	if (newlen <= i_size_read(dest))
> +		return 0;
> +
> +	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
> +				   OCFS2_INODE_UPDATE_CREDITS);
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		mlog_errno(ret);
> +		return ret;
> +	}
> +
> +	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
> +				      OCFS2_JOURNAL_ACCESS_WRITE);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_commit;
> +	}
> +
> +	spin_lock(&OCFS2_I(dest)->ip_lock);
> +	if (newlen > i_size_read(dest)) {
> +		i_size_write(dest, newlen);
> +		di->i_size = newlen;

di->i_size = cpu_to_le64(newlen);

> +	}
> +	spin_unlock(&OCFS2_I(dest)->ip_lock);
> +

Add ocfs2_update_inode_fsync_trans() here? Looks this function was introduced by you to 
improve efficiency.
Just want to awake your memory about this, though I don't know about the details why it 
should be.

Eric

> +	ocfs2_journal_dirty(handle, d_bh);
> +
> +out_commit:
> +	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
> +	return ret;
> +}
> +
> +/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
> +static int ocfs2_reflink_remap_extent(struct inode *s_inode,
> +				      struct buffer_head *s_bh,
> +				      loff_t pos_in,
> +				      struct inode *t_inode,
> +				      struct buffer_head *t_bh,
> +				      loff_t pos_out,
> +				      loff_t len,
> +				      struct ocfs2_cached_dealloc_ctxt *dealloc)
> +{
> +	struct ocfs2_extent_tree s_et;
> +	struct ocfs2_extent_tree t_et;
> +	struct ocfs2_dinode *dis;
> +	struct buffer_head *ref_root_bh = NULL;
> +	struct ocfs2_refcount_tree *ref_tree;
> +	struct ocfs2_super *osb;
> +	loff_t pstart, plen;
> +	u32 p_cluster, num_clusters, slast, spos, tpos;
> +	unsigned int ext_flags;
> +	int ret = 0;
> +
> +	osb = OCFS2_SB(s_inode->i_sb);
> +	dis = (struct ocfs2_dinode *)s_bh->b_data;
> +	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
> +	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
> +
> +	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
> +	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
> +	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
> +
> +	while (spos < slast) {
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out;
> +		}
> +
> +		/* Look up the extent. */
> +		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
> +					 &num_clusters, &ext_flags);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		num_clusters = min_t(u32, num_clusters, slast - spos);
> +
> +		/* Punch out the dest range. */
> +		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
> +		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
> +		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		if (p_cluster == 0)
> +			goto next_loop;
> +
> +		/* Lock the refcount btree... */
> +		ret = ocfs2_lock_refcount_tree(osb,
> +					       le64_to_cpu(dis->i_refcount_loc),
> +					       1, &ref_tree, &ref_root_bh);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		/* Mark s_inode's extent as refcounted. */
> +		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
> +			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
> +						      &ref_tree->rf_ci,
> +						      ref_root_bh, spos,
> +						      p_cluster, num_clusters,
> +						      dealloc, NULL);
> +			if (ret) {
> +				mlog_errno(ret);
> +				goto out_unlock_refcount;
> +			}
> +		}
> +
> +		/* Map in the new extent. */
> +		ext_flags |= OCFS2_EXT_REFCOUNTED;
> +		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
> +						  &ref_tree->rf_ci,
> +						  ref_root_bh,
> +						  tpos, p_cluster,
> +						  num_clusters,
> +						  ext_flags,
> +						  dealloc);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out_unlock_refcount;
> +		}
> +
> +		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> +		brelse(ref_root_bh);
> +next_loop:
> +		spos += num_clusters;
> +		tpos += num_clusters;
> +	}
> +
> +out:
> +	return ret;
> +out_unlock_refcount:
> +	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> +	brelse(ref_root_bh);
> +	return ret;
> +}
> +
> +/* Set up refcount tree and remap s_inode to t_inode. */
> +static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
> +				      struct buffer_head *s_bh,
> +				      loff_t pos_in,
> +				      struct inode *t_inode,
> +				      struct buffer_head *t_bh,
> +				      loff_t pos_out,
> +				      loff_t len)
> +{
> +	struct ocfs2_cached_dealloc_ctxt dealloc;
> +	struct ocfs2_super *osb;
> +	struct ocfs2_dinode *dis;
> +	struct ocfs2_dinode *dit;
> +	int ret;
> +
> +	osb = OCFS2_SB(s_inode->i_sb);
> +	dis = (struct ocfs2_dinode *)s_bh->b_data;
> +	dit = (struct ocfs2_dinode *)t_bh->b_data;
> +	ocfs2_init_dealloc_ctxt(&dealloc);
> +
> +	/*
> +	 * If both inodes belong to two different refcount groups then
> +	 * forget it because we don't know how (or want) to go merging
> +	 * refcount trees.
> +	 */
> +	ret = -EOPNOTSUPP;
> +	if (ocfs2_is_refcount_inode(s_inode) &&
> +	    ocfs2_is_refcount_inode(t_inode) &&
> +	    le64_to_cpu(dis->i_refcount_loc) !=
> +	    le64_to_cpu(dit->i_refcount_loc))
> +		goto out;
> +
> +	/* Neither inode has a refcount tree.  Add one to s_inode. */
> +	if (!ocfs2_is_refcount_inode(s_inode) &&
> +	    !ocfs2_is_refcount_inode(t_inode)) {
> +		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +
> +	/* Ensure that both inodes end up with the same refcount tree. */
> +	if (!ocfs2_is_refcount_inode(s_inode)) {
> +		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
> +					      le64_to_cpu(dit->i_refcount_loc));
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +	if (!ocfs2_is_refcount_inode(t_inode)) {
> +		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
> +					      le64_to_cpu(dis->i_refcount_loc));
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * If we're reflinking the entire file and the source is inline
> +	 * data, just copy the contents.
> +	 */
> +	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
> +	    i_size_read(t_inode) <= len &&
> +	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
> +		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
> +		if (ret)
> +			mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
> +					 pos_out, len, &dealloc);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +out:
> +	if (ocfs2_dealloc_has_cluster(&dealloc)) {
> +		ocfs2_schedule_truncate_log_flush(osb, 1);
> +		ocfs2_run_deallocs(osb, &dealloc);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Lock an inode and grab a bh pointing to the inode. */
> +static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
> +				     struct buffer_head **bh1,
> +				     struct inode *t_inode,
> +				     struct buffer_head **bh2)
> +{
> +	struct inode *inode1;
> +	struct inode *inode2;
> +	struct ocfs2_inode_info *oi1;
> +	struct ocfs2_inode_info *oi2;
> +	bool same_inode = (s_inode == t_inode);
> +	int status;
> +
> +	/* First grab the VFS and rw locks. */
> +	inode1 = s_inode;
> +	inode2 = t_inode;
> +	if (inode1->i_ino > inode2->i_ino)
> +		swap(inode1, inode2);
> +
> +	inode_lock(inode1);
> +	status = ocfs2_rw_lock(inode1, 1);
> +	if (status) {
> +		mlog_errno(status);
> +		goto out_i1;
> +	}
> +	if (!same_inode) {
> +		inode_lock_nested(inode2, I_MUTEX_CHILD);
> +		status = ocfs2_rw_lock(inode2, 1);
> +		if (status) {
> +			mlog_errno(status);
> +			goto out_i2;
> +		}
> +	}
> +
> +	/* Now go for the cluster locks */
> +	oi1 = OCFS2_I(inode1);
> +	oi2 = OCFS2_I(inode2);
> +
> +	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
> +				(unsigned long long)oi2->ip_blkno);
> +
> +	if (*bh1)
> +		*bh1 = NULL;
> +	if (*bh2)
> +		*bh2 = NULL;
> +
> +	/* We always want to lock the one with the lower lockid first. */
> +	if (oi1->ip_blkno > oi2->ip_blkno)
> +		mlog_errno(-ENOLCK);
> +
> +	/* lock id1 */
> +	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
> +	if (status < 0) {
> +		if (status != -ENOENT)
> +			mlog_errno(status);
> +		goto out_rw2;
> +	}
> +
> +	/* lock id2 */
> +	if (!same_inode) {
> +		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
> +						 OI_LS_REFLINK_TARGET);
> +		if (status < 0) {
> +			if (status != -ENOENT)
> +				mlog_errno(status);
> +			goto out_cl1;
> +		}
> +	} else
> +		*bh2 = *bh1;
> +
> +	trace_ocfs2_double_lock_end(
> +			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
> +			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
> +
> +	return 0;
> +
> +out_cl1:
> +	ocfs2_inode_unlock(inode1, 1);
> +	brelse(*bh1);
> +	*bh1 = NULL;
> +out_rw2:
> +	ocfs2_rw_unlock(inode2, 1);
> +out_i2:
> +	inode_unlock(inode2);
> +	ocfs2_rw_unlock(inode1, 1);
> +out_i1:
> +	inode_unlock(inode1);
> +	return status;
> +}
> +
> +/* Unlock both inodes and release buffers. */
> +static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
> +					struct buffer_head *s_bh,
> +					struct inode *t_inode,
> +					struct buffer_head *t_bh)
> +{
> +	ocfs2_inode_unlock(s_inode, 1);
> +	ocfs2_rw_unlock(s_inode, 1);
> +	inode_unlock(s_inode);
> +	brelse(s_bh);
> +
> +	if (s_inode == t_inode)
> +		return;
> +
> +	ocfs2_inode_unlock(t_inode, 1);
> +	ocfs2_rw_unlock(t_inode, 1);
> +	inode_unlock(t_inode);
> +	brelse(t_bh);
> +}
> +
> +/*
> + * Read a page's worth of file data into the page cache.  Return the page
> + * locked.
> + */
> +static struct page *ocfs2_reflink_get_page(struct inode *inode,
> +					   loff_t offset)
> +{
> +	struct address_space *mapping;
> +	struct page *page;
> +	pgoff_t n;
> +
> +	n = offset >> PAGE_SHIFT;
> +	mapping = inode->i_mapping;
> +	page = read_mapping_page(mapping, n, NULL);
> +	if (IS_ERR(page))
> +		return page;
> +	if (!PageUptodate(page)) {
> +		put_page(page);
> +		return ERR_PTR(-EIO);
> +	}
> +	lock_page(page);
> +	return page;
> +}
> +
> +/*
> + * Compare extents of two files to see if they are the same.
> + */
> +static int ocfs2_reflink_compare_extents(struct inode *src,
> +					 loff_t srcoff,
> +					 struct inode *dest,
> +					 loff_t destoff,
> +					 loff_t len,
> +					 bool *is_same)
> +{
> +	loff_t src_poff;
> +	loff_t dest_poff;
> +	void *src_addr;
> +	void *dest_addr;
> +	struct page *src_page;
> +	struct page *dest_page;
> +	loff_t cmp_len;
> +	bool same;
> +	int error;
> +
> +	error = -EINVAL;
> +	same = true;
> +	while (len) {
> +		src_poff = srcoff & (PAGE_SIZE - 1);
> +		dest_poff = destoff & (PAGE_SIZE - 1);
> +		cmp_len = min(PAGE_SIZE - src_poff,
> +			      PAGE_SIZE - dest_poff);
> +		cmp_len = min(cmp_len, len);
> +		if (cmp_len <= 0) {
> +			mlog_errno(-EUCLEAN);
> +			goto out_error;
> +		}
> +
> +		src_page = ocfs2_reflink_get_page(src, srcoff);
> +		if (IS_ERR(src_page)) {
> +			error = PTR_ERR(src_page);
> +			goto out_error;
> +		}
> +		dest_page = ocfs2_reflink_get_page(dest, destoff);
> +		if (IS_ERR(dest_page)) {
> +			error = PTR_ERR(dest_page);
> +			unlock_page(src_page);
> +			put_page(src_page);
> +			goto out_error;
> +		}
> +		src_addr = kmap_atomic(src_page);
> +		dest_addr = kmap_atomic(dest_page);
> +
> +		flush_dcache_page(src_page);
> +		flush_dcache_page(dest_page);
> +
> +		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> +			same = false;
> +
> +		kunmap_atomic(dest_addr);
> +		kunmap_atomic(src_addr);
> +		unlock_page(dest_page);
> +		unlock_page(src_page);
> +		put_page(dest_page);
> +		put_page(src_page);
> +
> +		if (!same)
> +			break;
> +
> +		srcoff += cmp_len;
> +		destoff += cmp_len;
> +		len -= cmp_len;
> +	}
> +
> +	*is_same = same;
> +	return 0;
> +
> +out_error:
> +	return error;
> +}
> +
> +/* Link a range of blocks from one file to another. */
> +int ocfs2_reflink_remap_range(struct file *file_in,
> +			      loff_t pos_in,
> +			      struct file *file_out,
> +			      loff_t pos_out,
> +			      u64 len,
> +			      bool is_dedupe)
> +{
> +	struct inode *inode_in = file_inode(file_in);
> +	struct inode *inode_out = file_inode(file_out);
> +	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
> +	struct buffer_head *in_bh = NULL, *out_bh = NULL;
> +	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
> +	bool same_inode = (inode_in == inode_out);
> +	bool is_same = false;
> +	loff_t isize;
> +	ssize_t ret;
> +	loff_t blen;
> +
> +	if (!ocfs2_refcount_tree(osb))
> +		return -EOPNOTSUPP;
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	/* Lock both files against IO */
> +	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
> +	if (ret)
> +		return ret;
> +
> +	ret = -EINVAL;
> +	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
> +	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
> +		goto out_unlock;
> +
> +	/* Don't touch certain kinds of inodes */
> +	ret = -EPERM;
> +	if (IS_IMMUTABLE(inode_out))
> +		goto out_unlock;
> +
> +	ret = -ETXTBSY;
> +	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> +		goto out_unlock;
> +
> +	/* Don't reflink dirs, pipes, sockets... */
> +	ret = -EISDIR;
> +	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> +		goto out_unlock;
> +	ret = -EINVAL;
> +	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
> +		goto out_unlock;
> +	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +		goto out_unlock;
> +
> +	/* Are we going all the way to the end? */
> +	isize = i_size_read(inode_in);
> +	if (isize == 0) {
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +
> +	if (len == 0)
> +		len = isize - pos_in;
> +
> +	/* Ensure offsets don't wrap and the input is inside i_size */
> +	if (pos_in + len < pos_in || pos_out + len < pos_out ||
> +	    pos_in + len > isize)
> +		goto out_unlock;
> +
> +	/* Don't allow dedupe past EOF in the dest file */
> +	if (is_dedupe) {
> +		loff_t	disize;
> +
> +		disize = i_size_read(inode_out);
> +		if (pos_out >= disize || pos_out + len > disize)
> +			goto out_unlock;
> +	}
> +
> +	/* If we're linking to EOF, continue to the block boundary. */
> +	if (pos_in + len == isize)
> +		blen = ALIGN(isize, bs) - pos_in;
> +	else
> +		blen = len;
> +
> +	/* Only reflink if we're aligned to block boundaries */
> +	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
> +	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
> +		goto out_unlock;
> +
> +	/* Don't allow overlapped reflink within the same file */
> +	if (same_inode) {
> +		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
> +			goto out_unlock;
> +	}
> +
> +	/* Wait for the completion of any pending IOs on both files */
> +	inode_dio_wait(inode_in);
> +	if (!same_inode)
> +		inode_dio_wait(inode_out);
> +
> +	ret = filemap_write_and_wait_range(inode_in->i_mapping,
> +			pos_in, pos_in + len - 1);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = filemap_write_and_wait_range(inode_out->i_mapping,
> +			pos_out, pos_out + len - 1);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/*
> +	 * Check that the extents are the same.
> +	 */
> +	if (is_dedupe) {
> +		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
> +						    inode_out, pos_out,
> +						    len, &is_same);
> +		if (ret)
> +			goto out_unlock;
> +		if (!is_same) {
> +			ret = -EBADE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/* Lock out changes to the allocation maps */
> +	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> +	if (!same_inode)
> +		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
> +				  SINGLE_DEPTH_NESTING);
> +
> +	/*
> +	 * Invalidate the page cache so that we can clear any CoW mappings
> +	 * in the destination file.
> +	 */
> +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> +				   PAGE_ALIGN(pos_out + len) - 1);
> +
> +	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
> +					 out_bh, pos_out, len);
> +
> +	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> +	if (!same_inode)
> +		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	/*
> +	 * Empty the extent map so that we may get the right extent
> +	 * record from the disk.
> +	 */
> +	ocfs2_extent_map_trunc(inode_in, 0);
> +	ocfs2_extent_map_trunc(inode_out, 0);
> +
> +	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> +	return 0;
> +
> +out_unlock:
> +	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> +	return ret;
> +}
> diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> index 553edfb..c023e88 100644
> --- a/fs/ocfs2/refcounttree.h
> +++ b/fs/ocfs2/refcounttree.h
> @@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
>   			const char __user *oldname,
>   			const char __user *newname,
>   			bool preserve);
> +int ocfs2_reflink_remap_range(struct file *file_in,
> +			      loff_t pos_in,
> +			      struct file *file_out,
> +			      loff_t pos_out,
> +			      u64 len,
> +			      bool is_dedupe);
> +
>   #endif /* OCFS2_REFCOUNTTREE_H */
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-11  5:49     ` Eric Ren
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  5:49 UTC (permalink / raw)
  To: Darrick J. Wong, mfasheh, jlbec; +Cc: linux-fsdevel, ocfs2-devel

Hi,

A few issues obvious to me:

On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> Connect the new VFS clone_range, copy_range, and dedupe_range features
> to the existing reflink capability of ocfs2.  Compared to the existing
> ocfs2 reflink ioctl We have to do things a little differently to support
> the VFS semantics (we can clone subranges of a file but we don't clone
> xattrs), but the VFS ioctls are more broadly supported.

How can I test the new ocfs2 reflink (with this patch) manually? What commands should I
use to do xxx_range things?

>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/ocfs2/file.c         |   62 ++++-
>   fs/ocfs2/file.h         |    3
>   fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
>   fs/ocfs2/refcounttree.h |    7 +
>   4 files changed, 688 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 000c234..d5a022d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
>   	*done = ret;
>   }
>   
> -static int ocfs2_remove_inode_range(struct inode *inode,
> -				    struct buffer_head *di_bh, u64 byte_start,
> -				    u64 byte_len)
> +int ocfs2_remove_inode_range(struct inode *inode,
> +			     struct buffer_head *di_bh, u64 byte_start,
> +			     u64 byte_len)
>   {
>   	int ret = 0, flags = 0, done = 0, i;
>   	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
> @@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
>   	return offset;
>   }
>   
> +static ssize_t ocfs2_file_copy_range(struct file *file_in,
> +				     loff_t pos_in,
> +				     struct file *file_out,
> +				     loff_t pos_out,
> +				     size_t len,
> +				     unsigned int flags)
> +{
> +	int error;
> +
> +	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +					  len, false);
> +	if (error)
> +		return error;
> +	return len;
> +}
> +
> +static int ocfs2_file_clone_range(struct file *file_in,
> +				  loff_t pos_in,
> +				  struct file *file_out,
> +				  loff_t pos_out,
> +				  u64 len)
> +{
> +	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +					 len, false);
> +}
> +
> +#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> +static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
> +				       u64 loff,
> +				       u64 len,
> +				       struct file *dst_file,
> +				       u64 dst_loff)
> +{
> +	int error;
> +
> +	/*
> +	 * Limit the total length we will dedupe for each operation.
> +	 * This is intended to bound the total time spent in this
> +	 * ioctl to something sane.
> +	 */
> +	if (len > OCFS2_MAX_DEDUPE_LEN)
> +		len = OCFS2_MAX_DEDUPE_LEN;
> +
> +	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
> +					  len, true);
> +	if (error)
> +		return error;
> +	return len;
> +}
> +
>   const struct inode_operations ocfs2_file_iops = {
>   	.setattr	= ocfs2_setattr,
>   	.getattr	= ocfs2_getattr,
> @@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
>   	.splice_read	= generic_file_splice_read,
>   	.splice_write	= iter_file_splice_write,
>   	.fallocate	= ocfs2_fallocate,
> +	.copy_file_range = ocfs2_file_copy_range,
> +	.clone_file_range = ocfs2_file_clone_range,
> +	.dedupe_file_range = ocfs2_file_dedupe_range,
>   };
>   
>   const struct file_operations ocfs2_dops = {
> @@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
>   	.splice_read	= generic_file_splice_read,
>   	.splice_write	= iter_file_splice_write,
>   	.fallocate	= ocfs2_fallocate,
> +	.copy_file_range = ocfs2_file_copy_range,
> +	.clone_file_range = ocfs2_file_clone_range,
> +	.dedupe_file_range = ocfs2_file_dedupe_range,
>   };
>   
>   const struct file_operations ocfs2_dops_no_plocks = {
> diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
> index e8c62f2..897fd9a 100644
> --- a/fs/ocfs2/file.h
> +++ b/fs/ocfs2/file.h
> @@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
>   
>   int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
>   				   size_t count);
> +int ocfs2_remove_inode_range(struct inode *inode,
> +			     struct buffer_head *di_bh, u64 byte_start,
> +			     u64 byte_len);
>   #endif /* OCFS2_FILE_H */
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index d92b6c6..3e2198c 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -34,6 +34,7 @@
>   #include "xattr.h"
>   #include "namei.h"
>   #include "ocfs2_trace.h"
> +#include "file.h"
>   
>   #include <linux/bio.h>
>   #include <linux/blkdev.h>
> @@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
>   
>   	return error;
>   }
> +
> +/* Update destination inode size, if necessary. */
> +static int ocfs2_reflink_update_dest(struct inode *dest,
> +				     struct buffer_head *d_bh,
> +				     loff_t newlen)
> +{
> +	handle_t *handle;
> +	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
> +	int ret;
> +
> +	if (newlen <= i_size_read(dest))
> +		return 0;
> +
> +	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
> +				   OCFS2_INODE_UPDATE_CREDITS);
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		mlog_errno(ret);
> +		return ret;
> +	}
> +
> +	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
> +				      OCFS2_JOURNAL_ACCESS_WRITE);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_commit;
> +	}
> +
> +	spin_lock(&OCFS2_I(dest)->ip_lock);
> +	if (newlen > i_size_read(dest)) {
> +		i_size_write(dest, newlen);
> +		di->i_size = newlen;

di->i_size = cpu_to_le64(newlen);

> +	}
> +	spin_unlock(&OCFS2_I(dest)->ip_lock);
> +

Add ocfs2_update_inode_fsync_trans() here? Looks this function was introduced by you to 
improve efficiency.
Just want to awake your memory about this, though I don't know about the details why it 
should be.

Eric

> +	ocfs2_journal_dirty(handle, d_bh);
> +
> +out_commit:
> +	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
> +	return ret;
> +}
> +
> +/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
> +static int ocfs2_reflink_remap_extent(struct inode *s_inode,
> +				      struct buffer_head *s_bh,
> +				      loff_t pos_in,
> +				      struct inode *t_inode,
> +				      struct buffer_head *t_bh,
> +				      loff_t pos_out,
> +				      loff_t len,
> +				      struct ocfs2_cached_dealloc_ctxt *dealloc)
> +{
> +	struct ocfs2_extent_tree s_et;
> +	struct ocfs2_extent_tree t_et;
> +	struct ocfs2_dinode *dis;
> +	struct buffer_head *ref_root_bh = NULL;
> +	struct ocfs2_refcount_tree *ref_tree;
> +	struct ocfs2_super *osb;
> +	loff_t pstart, plen;
> +	u32 p_cluster, num_clusters, slast, spos, tpos;
> +	unsigned int ext_flags;
> +	int ret = 0;
> +
> +	osb = OCFS2_SB(s_inode->i_sb);
> +	dis = (struct ocfs2_dinode *)s_bh->b_data;
> +	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
> +	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
> +
> +	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
> +	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
> +	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
> +
> +	while (spos < slast) {
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out;
> +		}
> +
> +		/* Look up the extent. */
> +		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
> +					 &num_clusters, &ext_flags);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		num_clusters = min_t(u32, num_clusters, slast - spos);
> +
> +		/* Punch out the dest range. */
> +		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
> +		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
> +		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		if (p_cluster == 0)
> +			goto next_loop;
> +
> +		/* Lock the refcount btree... */
> +		ret = ocfs2_lock_refcount_tree(osb,
> +					       le64_to_cpu(dis->i_refcount_loc),
> +					       1, &ref_tree, &ref_root_bh);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +
> +		/* Mark s_inode's extent as refcounted. */
> +		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
> +			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
> +						      &ref_tree->rf_ci,
> +						      ref_root_bh, spos,
> +						      p_cluster, num_clusters,
> +						      dealloc, NULL);
> +			if (ret) {
> +				mlog_errno(ret);
> +				goto out_unlock_refcount;
> +			}
> +		}
> +
> +		/* Map in the new extent. */
> +		ext_flags |= OCFS2_EXT_REFCOUNTED;
> +		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
> +						  &ref_tree->rf_ci,
> +						  ref_root_bh,
> +						  tpos, p_cluster,
> +						  num_clusters,
> +						  ext_flags,
> +						  dealloc);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out_unlock_refcount;
> +		}
> +
> +		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> +		brelse(ref_root_bh);
> +next_loop:
> +		spos += num_clusters;
> +		tpos += num_clusters;
> +	}
> +
> +out:
> +	return ret;
> +out_unlock_refcount:
> +	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> +	brelse(ref_root_bh);
> +	return ret;
> +}
> +
> +/* Set up refcount tree and remap s_inode to t_inode. */
> +static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
> +				      struct buffer_head *s_bh,
> +				      loff_t pos_in,
> +				      struct inode *t_inode,
> +				      struct buffer_head *t_bh,
> +				      loff_t pos_out,
> +				      loff_t len)
> +{
> +	struct ocfs2_cached_dealloc_ctxt dealloc;
> +	struct ocfs2_super *osb;
> +	struct ocfs2_dinode *dis;
> +	struct ocfs2_dinode *dit;
> +	int ret;
> +
> +	osb = OCFS2_SB(s_inode->i_sb);
> +	dis = (struct ocfs2_dinode *)s_bh->b_data;
> +	dit = (struct ocfs2_dinode *)t_bh->b_data;
> +	ocfs2_init_dealloc_ctxt(&dealloc);
> +
> +	/*
> +	 * If both inodes belong to two different refcount groups then
> +	 * forget it because we don't know how (or want) to go merging
> +	 * refcount trees.
> +	 */
> +	ret = -EOPNOTSUPP;
> +	if (ocfs2_is_refcount_inode(s_inode) &&
> +	    ocfs2_is_refcount_inode(t_inode) &&
> +	    le64_to_cpu(dis->i_refcount_loc) !=
> +	    le64_to_cpu(dit->i_refcount_loc))
> +		goto out;
> +
> +	/* Neither inode has a refcount tree.  Add one to s_inode. */
> +	if (!ocfs2_is_refcount_inode(s_inode) &&
> +	    !ocfs2_is_refcount_inode(t_inode)) {
> +		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +
> +	/* Ensure that both inodes end up with the same refcount tree. */
> +	if (!ocfs2_is_refcount_inode(s_inode)) {
> +		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
> +					      le64_to_cpu(dit->i_refcount_loc));
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +	if (!ocfs2_is_refcount_inode(t_inode)) {
> +		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
> +					      le64_to_cpu(dis->i_refcount_loc));
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * If we're reflinking the entire file and the source is inline
> +	 * data, just copy the contents.
> +	 */
> +	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
> +	    i_size_read(t_inode) <= len &&
> +	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
> +		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
> +		if (ret)
> +			mlog_errno(ret);
> +		goto out;
> +	}
> +
> +	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
> +					 pos_out, len, &dealloc);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out;
> +	}
> +
> +out:
> +	if (ocfs2_dealloc_has_cluster(&dealloc)) {
> +		ocfs2_schedule_truncate_log_flush(osb, 1);
> +		ocfs2_run_deallocs(osb, &dealloc);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Lock an inode and grab a bh pointing to the inode. */
> +static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
> +				     struct buffer_head **bh1,
> +				     struct inode *t_inode,
> +				     struct buffer_head **bh2)
> +{
> +	struct inode *inode1;
> +	struct inode *inode2;
> +	struct ocfs2_inode_info *oi1;
> +	struct ocfs2_inode_info *oi2;
> +	bool same_inode = (s_inode == t_inode);
> +	int status;
> +
> +	/* First grab the VFS and rw locks. */
> +	inode1 = s_inode;
> +	inode2 = t_inode;
> +	if (inode1->i_ino > inode2->i_ino)
> +		swap(inode1, inode2);
> +
> +	inode_lock(inode1);
> +	status = ocfs2_rw_lock(inode1, 1);
> +	if (status) {
> +		mlog_errno(status);
> +		goto out_i1;
> +	}
> +	if (!same_inode) {
> +		inode_lock_nested(inode2, I_MUTEX_CHILD);
> +		status = ocfs2_rw_lock(inode2, 1);
> +		if (status) {
> +			mlog_errno(status);
> +			goto out_i2;
> +		}
> +	}
> +
> +	/* Now go for the cluster locks */
> +	oi1 = OCFS2_I(inode1);
> +	oi2 = OCFS2_I(inode2);
> +
> +	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
> +				(unsigned long long)oi2->ip_blkno);
> +
> +	if (*bh1)
> +		*bh1 = NULL;
> +	if (*bh2)
> +		*bh2 = NULL;
> +
> +	/* We always want to lock the one with the lower lockid first. */
> +	if (oi1->ip_blkno > oi2->ip_blkno)
> +		mlog_errno(-ENOLCK);
> +
> +	/* lock id1 */
> +	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
> +	if (status < 0) {
> +		if (status != -ENOENT)
> +			mlog_errno(status);
> +		goto out_rw2;
> +	}
> +
> +	/* lock id2 */
> +	if (!same_inode) {
> +		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
> +						 OI_LS_REFLINK_TARGET);
> +		if (status < 0) {
> +			if (status != -ENOENT)
> +				mlog_errno(status);
> +			goto out_cl1;
> +		}
> +	} else
> +		*bh2 = *bh1;
> +
> +	trace_ocfs2_double_lock_end(
> +			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
> +			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
> +
> +	return 0;
> +
> +out_cl1:
> +	ocfs2_inode_unlock(inode1, 1);
> +	brelse(*bh1);
> +	*bh1 = NULL;
> +out_rw2:
> +	ocfs2_rw_unlock(inode2, 1);
> +out_i2:
> +	inode_unlock(inode2);
> +	ocfs2_rw_unlock(inode1, 1);
> +out_i1:
> +	inode_unlock(inode1);
> +	return status;
> +}
> +
> +/* Unlock both inodes and release buffers. */
> +static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
> +					struct buffer_head *s_bh,
> +					struct inode *t_inode,
> +					struct buffer_head *t_bh)
> +{
> +	ocfs2_inode_unlock(s_inode, 1);
> +	ocfs2_rw_unlock(s_inode, 1);
> +	inode_unlock(s_inode);
> +	brelse(s_bh);
> +
> +	if (s_inode == t_inode)
> +		return;
> +
> +	ocfs2_inode_unlock(t_inode, 1);
> +	ocfs2_rw_unlock(t_inode, 1);
> +	inode_unlock(t_inode);
> +	brelse(t_bh);
> +}
> +
> +/*
> + * Read a page's worth of file data into the page cache.  Return the page
> + * locked.
> + */
> +static struct page *ocfs2_reflink_get_page(struct inode *inode,
> +					   loff_t offset)
> +{
> +	struct address_space *mapping;
> +	struct page *page;
> +	pgoff_t n;
> +
> +	n = offset >> PAGE_SHIFT;
> +	mapping = inode->i_mapping;
> +	page = read_mapping_page(mapping, n, NULL);
> +	if (IS_ERR(page))
> +		return page;
> +	if (!PageUptodate(page)) {
> +		put_page(page);
> +		return ERR_PTR(-EIO);
> +	}
> +	lock_page(page);
> +	return page;
> +}
> +
> +/*
> + * Compare extents of two files to see if they are the same.
> + */
> +static int ocfs2_reflink_compare_extents(struct inode *src,
> +					 loff_t srcoff,
> +					 struct inode *dest,
> +					 loff_t destoff,
> +					 loff_t len,
> +					 bool *is_same)
> +{
> +	loff_t src_poff;
> +	loff_t dest_poff;
> +	void *src_addr;
> +	void *dest_addr;
> +	struct page *src_page;
> +	struct page *dest_page;
> +	loff_t cmp_len;
> +	bool same;
> +	int error;
> +
> +	error = -EINVAL;
> +	same = true;
> +	while (len) {
> +		src_poff = srcoff & (PAGE_SIZE - 1);
> +		dest_poff = destoff & (PAGE_SIZE - 1);
> +		cmp_len = min(PAGE_SIZE - src_poff,
> +			      PAGE_SIZE - dest_poff);
> +		cmp_len = min(cmp_len, len);
> +		if (cmp_len <= 0) {
> +			mlog_errno(-EUCLEAN);
> +			goto out_error;
> +		}
> +
> +		src_page = ocfs2_reflink_get_page(src, srcoff);
> +		if (IS_ERR(src_page)) {
> +			error = PTR_ERR(src_page);
> +			goto out_error;
> +		}
> +		dest_page = ocfs2_reflink_get_page(dest, destoff);
> +		if (IS_ERR(dest_page)) {
> +			error = PTR_ERR(dest_page);
> +			unlock_page(src_page);
> +			put_page(src_page);
> +			goto out_error;
> +		}
> +		src_addr = kmap_atomic(src_page);
> +		dest_addr = kmap_atomic(dest_page);
> +
> +		flush_dcache_page(src_page);
> +		flush_dcache_page(dest_page);
> +
> +		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> +			same = false;
> +
> +		kunmap_atomic(dest_addr);
> +		kunmap_atomic(src_addr);
> +		unlock_page(dest_page);
> +		unlock_page(src_page);
> +		put_page(dest_page);
> +		put_page(src_page);
> +
> +		if (!same)
> +			break;
> +
> +		srcoff += cmp_len;
> +		destoff += cmp_len;
> +		len -= cmp_len;
> +	}
> +
> +	*is_same = same;
> +	return 0;
> +
> +out_error:
> +	return error;
> +}
> +
> +/* Link a range of blocks from one file to another. */
> +int ocfs2_reflink_remap_range(struct file *file_in,
> +			      loff_t pos_in,
> +			      struct file *file_out,
> +			      loff_t pos_out,
> +			      u64 len,
> +			      bool is_dedupe)
> +{
> +	struct inode *inode_in = file_inode(file_in);
> +	struct inode *inode_out = file_inode(file_out);
> +	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
> +	struct buffer_head *in_bh = NULL, *out_bh = NULL;
> +	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
> +	bool same_inode = (inode_in == inode_out);
> +	bool is_same = false;
> +	loff_t isize;
> +	ssize_t ret;
> +	loff_t blen;
> +
> +	if (!ocfs2_refcount_tree(osb))
> +		return -EOPNOTSUPP;
> +	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> +		return -EROFS;
> +
> +	/* Lock both files against IO */
> +	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
> +	if (ret)
> +		return ret;
> +
> +	ret = -EINVAL;
> +	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
> +	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
> +		goto out_unlock;
> +
> +	/* Don't touch certain kinds of inodes */
> +	ret = -EPERM;
> +	if (IS_IMMUTABLE(inode_out))
> +		goto out_unlock;
> +
> +	ret = -ETXTBSY;
> +	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> +		goto out_unlock;
> +
> +	/* Don't reflink dirs, pipes, sockets... */
> +	ret = -EISDIR;
> +	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> +		goto out_unlock;
> +	ret = -EINVAL;
> +	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
> +		goto out_unlock;
> +	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +		goto out_unlock;
> +
> +	/* Are we going all the way to the end? */
> +	isize = i_size_read(inode_in);
> +	if (isize == 0) {
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +
> +	if (len == 0)
> +		len = isize - pos_in;
> +
> +	/* Ensure offsets don't wrap and the input is inside i_size */
> +	if (pos_in + len < pos_in || pos_out + len < pos_out ||
> +	    pos_in + len > isize)
> +		goto out_unlock;
> +
> +	/* Don't allow dedupe past EOF in the dest file */
> +	if (is_dedupe) {
> +		loff_t	disize;
> +
> +		disize = i_size_read(inode_out);
> +		if (pos_out >= disize || pos_out + len > disize)
> +			goto out_unlock;
> +	}
> +
> +	/* If we're linking to EOF, continue to the block boundary. */
> +	if (pos_in + len == isize)
> +		blen = ALIGN(isize, bs) - pos_in;
> +	else
> +		blen = len;
> +
> +	/* Only reflink if we're aligned to block boundaries */
> +	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
> +	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
> +		goto out_unlock;
> +
> +	/* Don't allow overlapped reflink within the same file */
> +	if (same_inode) {
> +		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
> +			goto out_unlock;
> +	}
> +
> +	/* Wait for the completion of any pending IOs on both files */
> +	inode_dio_wait(inode_in);
> +	if (!same_inode)
> +		inode_dio_wait(inode_out);
> +
> +	ret = filemap_write_and_wait_range(inode_in->i_mapping,
> +			pos_in, pos_in + len - 1);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = filemap_write_and_wait_range(inode_out->i_mapping,
> +			pos_out, pos_out + len - 1);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/*
> +	 * Check that the extents are the same.
> +	 */
> +	if (is_dedupe) {
> +		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
> +						    inode_out, pos_out,
> +						    len, &is_same);
> +		if (ret)
> +			goto out_unlock;
> +		if (!is_same) {
> +			ret = -EBADE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/* Lock out changes to the allocation maps */
> +	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> +	if (!same_inode)
> +		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
> +				  SINGLE_DEPTH_NESTING);
> +
> +	/*
> +	 * Invalidate the page cache so that we can clear any CoW mappings
> +	 * in the destination file.
> +	 */
> +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> +				   PAGE_ALIGN(pos_out + len) - 1);
> +
> +	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
> +					 out_bh, pos_out, len);
> +
> +	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> +	if (!same_inode)
> +		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	/*
> +	 * Empty the extent map so that we may get the right extent
> +	 * record from the disk.
> +	 */
> +	ocfs2_extent_map_trunc(inode_in, 0);
> +	ocfs2_extent_map_trunc(inode_out, 0);
> +
> +	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
> +	if (ret) {
> +		mlog_errno(ret);
> +		goto out_unlock;
> +	}
> +
> +	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> +	return 0;
> +
> +out_unlock:
> +	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> +	return ret;
> +}
> diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> index 553edfb..c023e88 100644
> --- a/fs/ocfs2/refcounttree.h
> +++ b/fs/ocfs2/refcounttree.h
> @@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
>   			const char __user *oldname,
>   			const char __user *newname,
>   			bool preserve);
> +int ocfs2_reflink_remap_range(struct file *file_in,
> +			      loff_t pos_in,
> +			      struct file *file_out,
> +			      loff_t pos_out,
> +			      u64 len,
> +			      bool is_dedupe);
> +
>   #endif /* OCFS2_REFCOUNTTREE_H */
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-11  5:49     ` Eric Ren
@ 2016-11-11  6:20       ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11  6:20 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
> Hi,
> 
> A few issues obvious to me:
> 
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Connect the new VFS clone_range, copy_range, and dedupe_range features
> >to the existing reflink capability of ocfs2.  Compared to the existing
> >ocfs2 reflink ioctl We have to do things a little differently to support
> >the VFS semantics (we can clone subranges of a file but we don't clone
> >xattrs), but the VFS ioctls are more broadly supported.
> 
> How can I test the new ocfs2 reflink (with this patch) manually? What
> commands should I use to do xxx_range things?

See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.

The first two were added in xfsprogs 4.3, and copy_range in 4.7.

--D

> 
> >
> >Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >---
> >  fs/ocfs2/file.c         |   62 ++++-
> >  fs/ocfs2/file.h         |    3
> >  fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/ocfs2/refcounttree.h |    7 +
> >  4 files changed, 688 insertions(+), 3 deletions(-)
> >
> >
> >diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> >index 000c234..d5a022d 100644
> >--- a/fs/ocfs2/file.c
> >+++ b/fs/ocfs2/file.c
> >@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
> >  	*done = ret;
> >  }
> >-static int ocfs2_remove_inode_range(struct inode *inode,
> >-				    struct buffer_head *di_bh, u64 byte_start,
> >-				    u64 byte_len)
> >+int ocfs2_remove_inode_range(struct inode *inode,
> >+			     struct buffer_head *di_bh, u64 byte_start,
> >+			     u64 byte_len)
> >  {
> >  	int ret = 0, flags = 0, done = 0, i;
> >  	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
> >@@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
> >  	return offset;
> >  }
> >+static ssize_t ocfs2_file_copy_range(struct file *file_in,
> >+				     loff_t pos_in,
> >+				     struct file *file_out,
> >+				     loff_t pos_out,
> >+				     size_t len,
> >+				     unsigned int flags)
> >+{
> >+	int error;
> >+
> >+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> >+					  len, false);
> >+	if (error)
> >+		return error;
> >+	return len;
> >+}
> >+
> >+static int ocfs2_file_clone_range(struct file *file_in,
> >+				  loff_t pos_in,
> >+				  struct file *file_out,
> >+				  loff_t pos_out,
> >+				  u64 len)
> >+{
> >+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> >+					 len, false);
> >+}
> >+
> >+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> >+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
> >+				       u64 loff,
> >+				       u64 len,
> >+				       struct file *dst_file,
> >+				       u64 dst_loff)
> >+{
> >+	int error;
> >+
> >+	/*
> >+	 * Limit the total length we will dedupe for each operation.
> >+	 * This is intended to bound the total time spent in this
> >+	 * ioctl to something sane.
> >+	 */
> >+	if (len > OCFS2_MAX_DEDUPE_LEN)
> >+		len = OCFS2_MAX_DEDUPE_LEN;
> >+
> >+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
> >+					  len, true);
> >+	if (error)
> >+		return error;
> >+	return len;
> >+}
> >+
> >  const struct inode_operations ocfs2_file_iops = {
> >  	.setattr	= ocfs2_setattr,
> >  	.getattr	= ocfs2_getattr,
> >@@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
> >  	.splice_read	= generic_file_splice_read,
> >  	.splice_write	= iter_file_splice_write,
> >  	.fallocate	= ocfs2_fallocate,
> >+	.copy_file_range = ocfs2_file_copy_range,
> >+	.clone_file_range = ocfs2_file_clone_range,
> >+	.dedupe_file_range = ocfs2_file_dedupe_range,
> >  };
> >  const struct file_operations ocfs2_dops = {
> >@@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
> >  	.splice_read	= generic_file_splice_read,
> >  	.splice_write	= iter_file_splice_write,
> >  	.fallocate	= ocfs2_fallocate,
> >+	.copy_file_range = ocfs2_file_copy_range,
> >+	.clone_file_range = ocfs2_file_clone_range,
> >+	.dedupe_file_range = ocfs2_file_dedupe_range,
> >  };
> >  const struct file_operations ocfs2_dops_no_plocks = {
> >diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
> >index e8c62f2..897fd9a 100644
> >--- a/fs/ocfs2/file.h
> >+++ b/fs/ocfs2/file.h
> >@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
> >  int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
> >  				   size_t count);
> >+int ocfs2_remove_inode_range(struct inode *inode,
> >+			     struct buffer_head *di_bh, u64 byte_start,
> >+			     u64 byte_len);
> >  #endif /* OCFS2_FILE_H */
> >diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> >index d92b6c6..3e2198c 100644
> >--- a/fs/ocfs2/refcounttree.c
> >+++ b/fs/ocfs2/refcounttree.c
> >@@ -34,6 +34,7 @@
> >  #include "xattr.h"
> >  #include "namei.h"
> >  #include "ocfs2_trace.h"
> >+#include "file.h"
> >  #include <linux/bio.h>
> >  #include <linux/blkdev.h>
> >@@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
> >  	return error;
> >  }
> >+
> >+/* Update destination inode size, if necessary. */
> >+static int ocfs2_reflink_update_dest(struct inode *dest,
> >+				     struct buffer_head *d_bh,
> >+				     loff_t newlen)
> >+{
> >+	handle_t *handle;
> >+	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
> >+	int ret;
> >+
> >+	if (newlen <= i_size_read(dest))
> >+		return 0;
> >+
> >+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
> >+				   OCFS2_INODE_UPDATE_CREDITS);
> >+	if (IS_ERR(handle)) {
> >+		ret = PTR_ERR(handle);
> >+		mlog_errno(ret);
> >+		return ret;
> >+	}
> >+
> >+	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
> >+				      OCFS2_JOURNAL_ACCESS_WRITE);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_commit;
> >+	}
> >+
> >+	spin_lock(&OCFS2_I(dest)->ip_lock);
> >+	if (newlen > i_size_read(dest)) {
> >+		i_size_write(dest, newlen);
> >+		di->i_size = newlen;
> 
> di->i_size = cpu_to_le64(newlen);
> 
> >+	}
> >+	spin_unlock(&OCFS2_I(dest)->ip_lock);
> >+
> 
> Add ocfs2_update_inode_fsync_trans() here? Looks this function was
> introduced by you to improve efficiency.
> Just want to awake your memory about this, though I don't know about the
> details why it should be.
> 
> Eric
> 
> >+	ocfs2_journal_dirty(handle, d_bh);
> >+
> >+out_commit:
> >+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
> >+	return ret;
> >+}
> >+
> >+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
> >+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
> >+				      struct buffer_head *s_bh,
> >+				      loff_t pos_in,
> >+				      struct inode *t_inode,
> >+				      struct buffer_head *t_bh,
> >+				      loff_t pos_out,
> >+				      loff_t len,
> >+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
> >+{
> >+	struct ocfs2_extent_tree s_et;
> >+	struct ocfs2_extent_tree t_et;
> >+	struct ocfs2_dinode *dis;
> >+	struct buffer_head *ref_root_bh = NULL;
> >+	struct ocfs2_refcount_tree *ref_tree;
> >+	struct ocfs2_super *osb;
> >+	loff_t pstart, plen;
> >+	u32 p_cluster, num_clusters, slast, spos, tpos;
> >+	unsigned int ext_flags;
> >+	int ret = 0;
> >+
> >+	osb = OCFS2_SB(s_inode->i_sb);
> >+	dis = (struct ocfs2_dinode *)s_bh->b_data;
> >+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
> >+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
> >+
> >+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
> >+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
> >+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
> >+
> >+	while (spos < slast) {
> >+		if (fatal_signal_pending(current)) {
> >+			ret = -EINTR;
> >+			goto out;
> >+		}
> >+
> >+		/* Look up the extent. */
> >+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
> >+					 &num_clusters, &ext_flags);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		num_clusters = min_t(u32, num_clusters, slast - spos);
> >+
> >+		/* Punch out the dest range. */
> >+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
> >+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
> >+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		if (p_cluster == 0)
> >+			goto next_loop;
> >+
> >+		/* Lock the refcount btree... */
> >+		ret = ocfs2_lock_refcount_tree(osb,
> >+					       le64_to_cpu(dis->i_refcount_loc),
> >+					       1, &ref_tree, &ref_root_bh);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		/* Mark s_inode's extent as refcounted. */
> >+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
> >+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
> >+						      &ref_tree->rf_ci,
> >+						      ref_root_bh, spos,
> >+						      p_cluster, num_clusters,
> >+						      dealloc, NULL);
> >+			if (ret) {
> >+				mlog_errno(ret);
> >+				goto out_unlock_refcount;
> >+			}
> >+		}
> >+
> >+		/* Map in the new extent. */
> >+		ext_flags |= OCFS2_EXT_REFCOUNTED;
> >+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
> >+						  &ref_tree->rf_ci,
> >+						  ref_root_bh,
> >+						  tpos, p_cluster,
> >+						  num_clusters,
> >+						  ext_flags,
> >+						  dealloc);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out_unlock_refcount;
> >+		}
> >+
> >+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> >+		brelse(ref_root_bh);
> >+next_loop:
> >+		spos += num_clusters;
> >+		tpos += num_clusters;
> >+	}
> >+
> >+out:
> >+	return ret;
> >+out_unlock_refcount:
> >+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> >+	brelse(ref_root_bh);
> >+	return ret;
> >+}
> >+
> >+/* Set up refcount tree and remap s_inode to t_inode. */
> >+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
> >+				      struct buffer_head *s_bh,
> >+				      loff_t pos_in,
> >+				      struct inode *t_inode,
> >+				      struct buffer_head *t_bh,
> >+				      loff_t pos_out,
> >+				      loff_t len)
> >+{
> >+	struct ocfs2_cached_dealloc_ctxt dealloc;
> >+	struct ocfs2_super *osb;
> >+	struct ocfs2_dinode *dis;
> >+	struct ocfs2_dinode *dit;
> >+	int ret;
> >+
> >+	osb = OCFS2_SB(s_inode->i_sb);
> >+	dis = (struct ocfs2_dinode *)s_bh->b_data;
> >+	dit = (struct ocfs2_dinode *)t_bh->b_data;
> >+	ocfs2_init_dealloc_ctxt(&dealloc);
> >+
> >+	/*
> >+	 * If both inodes belong to two different refcount groups then
> >+	 * forget it because we don't know how (or want) to go merging
> >+	 * refcount trees.
> >+	 */
> >+	ret = -EOPNOTSUPP;
> >+	if (ocfs2_is_refcount_inode(s_inode) &&
> >+	    ocfs2_is_refcount_inode(t_inode) &&
> >+	    le64_to_cpu(dis->i_refcount_loc) !=
> >+	    le64_to_cpu(dit->i_refcount_loc))
> >+		goto out;
> >+
> >+	/* Neither inode has a refcount tree.  Add one to s_inode. */
> >+	if (!ocfs2_is_refcount_inode(s_inode) &&
> >+	    !ocfs2_is_refcount_inode(t_inode)) {
> >+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	/* Ensure that both inodes end up with the same refcount tree. */
> >+	if (!ocfs2_is_refcount_inode(s_inode)) {
> >+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
> >+					      le64_to_cpu(dit->i_refcount_loc));
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+	if (!ocfs2_is_refcount_inode(t_inode)) {
> >+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
> >+					      le64_to_cpu(dis->i_refcount_loc));
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	/*
> >+	 * If we're reflinking the entire file and the source is inline
> >+	 * data, just copy the contents.
> >+	 */
> >+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
> >+	    i_size_read(t_inode) <= len &&
> >+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
> >+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
> >+		if (ret)
> >+			mlog_errno(ret);
> >+		goto out;
> >+	}
> >+
> >+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
> >+					 pos_out, len, &dealloc);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out;
> >+	}
> >+
> >+out:
> >+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
> >+		ocfs2_schedule_truncate_log_flush(osb, 1);
> >+		ocfs2_run_deallocs(osb, &dealloc);
> >+	}
> >+
> >+	return ret;
> >+}
> >+
> >+/* Lock an inode and grab a bh pointing to the inode. */
> >+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
> >+				     struct buffer_head **bh1,
> >+				     struct inode *t_inode,
> >+				     struct buffer_head **bh2)
> >+{
> >+	struct inode *inode1;
> >+	struct inode *inode2;
> >+	struct ocfs2_inode_info *oi1;
> >+	struct ocfs2_inode_info *oi2;
> >+	bool same_inode = (s_inode == t_inode);
> >+	int status;
> >+
> >+	/* First grab the VFS and rw locks. */
> >+	inode1 = s_inode;
> >+	inode2 = t_inode;
> >+	if (inode1->i_ino > inode2->i_ino)
> >+		swap(inode1, inode2);
> >+
> >+	inode_lock(inode1);
> >+	status = ocfs2_rw_lock(inode1, 1);
> >+	if (status) {
> >+		mlog_errno(status);
> >+		goto out_i1;
> >+	}
> >+	if (!same_inode) {
> >+		inode_lock_nested(inode2, I_MUTEX_CHILD);
> >+		status = ocfs2_rw_lock(inode2, 1);
> >+		if (status) {
> >+			mlog_errno(status);
> >+			goto out_i2;
> >+		}
> >+	}
> >+
> >+	/* Now go for the cluster locks */
> >+	oi1 = OCFS2_I(inode1);
> >+	oi2 = OCFS2_I(inode2);
> >+
> >+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
> >+				(unsigned long long)oi2->ip_blkno);
> >+
> >+	if (*bh1)
> >+		*bh1 = NULL;
> >+	if (*bh2)
> >+		*bh2 = NULL;
> >+
> >+	/* We always want to lock the one with the lower lockid first. */
> >+	if (oi1->ip_blkno > oi2->ip_blkno)
> >+		mlog_errno(-ENOLCK);
> >+
> >+	/* lock id1 */
> >+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
> >+	if (status < 0) {
> >+		if (status != -ENOENT)
> >+			mlog_errno(status);
> >+		goto out_rw2;
> >+	}
> >+
> >+	/* lock id2 */
> >+	if (!same_inode) {
> >+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
> >+						 OI_LS_REFLINK_TARGET);
> >+		if (status < 0) {
> >+			if (status != -ENOENT)
> >+				mlog_errno(status);
> >+			goto out_cl1;
> >+		}
> >+	} else
> >+		*bh2 = *bh1;
> >+
> >+	trace_ocfs2_double_lock_end(
> >+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
> >+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
> >+
> >+	return 0;
> >+
> >+out_cl1:
> >+	ocfs2_inode_unlock(inode1, 1);
> >+	brelse(*bh1);
> >+	*bh1 = NULL;
> >+out_rw2:
> >+	ocfs2_rw_unlock(inode2, 1);
> >+out_i2:
> >+	inode_unlock(inode2);
> >+	ocfs2_rw_unlock(inode1, 1);
> >+out_i1:
> >+	inode_unlock(inode1);
> >+	return status;
> >+}
> >+
> >+/* Unlock both inodes and release buffers. */
> >+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
> >+					struct buffer_head *s_bh,
> >+					struct inode *t_inode,
> >+					struct buffer_head *t_bh)
> >+{
> >+	ocfs2_inode_unlock(s_inode, 1);
> >+	ocfs2_rw_unlock(s_inode, 1);
> >+	inode_unlock(s_inode);
> >+	brelse(s_bh);
> >+
> >+	if (s_inode == t_inode)
> >+		return;
> >+
> >+	ocfs2_inode_unlock(t_inode, 1);
> >+	ocfs2_rw_unlock(t_inode, 1);
> >+	inode_unlock(t_inode);
> >+	brelse(t_bh);
> >+}
> >+
> >+/*
> >+ * Read a page's worth of file data into the page cache.  Return the page
> >+ * locked.
> >+ */
> >+static struct page *ocfs2_reflink_get_page(struct inode *inode,
> >+					   loff_t offset)
> >+{
> >+	struct address_space *mapping;
> >+	struct page *page;
> >+	pgoff_t n;
> >+
> >+	n = offset >> PAGE_SHIFT;
> >+	mapping = inode->i_mapping;
> >+	page = read_mapping_page(mapping, n, NULL);
> >+	if (IS_ERR(page))
> >+		return page;
> >+	if (!PageUptodate(page)) {
> >+		put_page(page);
> >+		return ERR_PTR(-EIO);
> >+	}
> >+	lock_page(page);
> >+	return page;
> >+}
> >+
> >+/*
> >+ * Compare extents of two files to see if they are the same.
> >+ */
> >+static int ocfs2_reflink_compare_extents(struct inode *src,
> >+					 loff_t srcoff,
> >+					 struct inode *dest,
> >+					 loff_t destoff,
> >+					 loff_t len,
> >+					 bool *is_same)
> >+{
> >+	loff_t src_poff;
> >+	loff_t dest_poff;
> >+	void *src_addr;
> >+	void *dest_addr;
> >+	struct page *src_page;
> >+	struct page *dest_page;
> >+	loff_t cmp_len;
> >+	bool same;
> >+	int error;
> >+
> >+	error = -EINVAL;
> >+	same = true;
> >+	while (len) {
> >+		src_poff = srcoff & (PAGE_SIZE - 1);
> >+		dest_poff = destoff & (PAGE_SIZE - 1);
> >+		cmp_len = min(PAGE_SIZE - src_poff,
> >+			      PAGE_SIZE - dest_poff);
> >+		cmp_len = min(cmp_len, len);
> >+		if (cmp_len <= 0) {
> >+			mlog_errno(-EUCLEAN);
> >+			goto out_error;
> >+		}
> >+
> >+		src_page = ocfs2_reflink_get_page(src, srcoff);
> >+		if (IS_ERR(src_page)) {
> >+			error = PTR_ERR(src_page);
> >+			goto out_error;
> >+		}
> >+		dest_page = ocfs2_reflink_get_page(dest, destoff);
> >+		if (IS_ERR(dest_page)) {
> >+			error = PTR_ERR(dest_page);
> >+			unlock_page(src_page);
> >+			put_page(src_page);
> >+			goto out_error;
> >+		}
> >+		src_addr = kmap_atomic(src_page);
> >+		dest_addr = kmap_atomic(dest_page);
> >+
> >+		flush_dcache_page(src_page);
> >+		flush_dcache_page(dest_page);
> >+
> >+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> >+			same = false;
> >+
> >+		kunmap_atomic(dest_addr);
> >+		kunmap_atomic(src_addr);
> >+		unlock_page(dest_page);
> >+		unlock_page(src_page);
> >+		put_page(dest_page);
> >+		put_page(src_page);
> >+
> >+		if (!same)
> >+			break;
> >+
> >+		srcoff += cmp_len;
> >+		destoff += cmp_len;
> >+		len -= cmp_len;
> >+	}
> >+
> >+	*is_same = same;
> >+	return 0;
> >+
> >+out_error:
> >+	return error;
> >+}
> >+
> >+/* Link a range of blocks from one file to another. */
> >+int ocfs2_reflink_remap_range(struct file *file_in,
> >+			      loff_t pos_in,
> >+			      struct file *file_out,
> >+			      loff_t pos_out,
> >+			      u64 len,
> >+			      bool is_dedupe)
> >+{
> >+	struct inode *inode_in = file_inode(file_in);
> >+	struct inode *inode_out = file_inode(file_out);
> >+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
> >+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
> >+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
> >+	bool same_inode = (inode_in == inode_out);
> >+	bool is_same = false;
> >+	loff_t isize;
> >+	ssize_t ret;
> >+	loff_t blen;
> >+
> >+	if (!ocfs2_refcount_tree(osb))
> >+		return -EOPNOTSUPP;
> >+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> >+		return -EROFS;
> >+
> >+	/* Lock both files against IO */
> >+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
> >+	if (ret)
> >+		return ret;
> >+
> >+	ret = -EINVAL;
> >+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
> >+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
> >+		goto out_unlock;
> >+
> >+	/* Don't touch certain kinds of inodes */
> >+	ret = -EPERM;
> >+	if (IS_IMMUTABLE(inode_out))
> >+		goto out_unlock;
> >+
> >+	ret = -ETXTBSY;
> >+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> >+		goto out_unlock;
> >+
> >+	/* Don't reflink dirs, pipes, sockets... */
> >+	ret = -EISDIR;
> >+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> >+		goto out_unlock;
> >+	ret = -EINVAL;
> >+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
> >+		goto out_unlock;
> >+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> >+		goto out_unlock;
> >+
> >+	/* Are we going all the way to the end? */
> >+	isize = i_size_read(inode_in);
> >+	if (isize == 0) {
> >+		ret = 0;
> >+		goto out_unlock;
> >+	}
> >+
> >+	if (len == 0)
> >+		len = isize - pos_in;
> >+
> >+	/* Ensure offsets don't wrap and the input is inside i_size */
> >+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
> >+	    pos_in + len > isize)
> >+		goto out_unlock;
> >+
> >+	/* Don't allow dedupe past EOF in the dest file */
> >+	if (is_dedupe) {
> >+		loff_t	disize;
> >+
> >+		disize = i_size_read(inode_out);
> >+		if (pos_out >= disize || pos_out + len > disize)
> >+			goto out_unlock;
> >+	}
> >+
> >+	/* If we're linking to EOF, continue to the block boundary. */
> >+	if (pos_in + len == isize)
> >+		blen = ALIGN(isize, bs) - pos_in;
> >+	else
> >+		blen = len;
> >+
> >+	/* Only reflink if we're aligned to block boundaries */
> >+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
> >+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
> >+		goto out_unlock;
> >+
> >+	/* Don't allow overlapped reflink within the same file */
> >+	if (same_inode) {
> >+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
> >+			goto out_unlock;
> >+	}
> >+
> >+	/* Wait for the completion of any pending IOs on both files */
> >+	inode_dio_wait(inode_in);
> >+	if (!same_inode)
> >+		inode_dio_wait(inode_out);
> >+
> >+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
> >+			pos_in, pos_in + len - 1);
> >+	if (ret)
> >+		goto out_unlock;
> >+
> >+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
> >+			pos_out, pos_out + len - 1);
> >+	if (ret)
> >+		goto out_unlock;
> >+
> >+	/*
> >+	 * Check that the extents are the same.
> >+	 */
> >+	if (is_dedupe) {
> >+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
> >+						    inode_out, pos_out,
> >+						    len, &is_same);
> >+		if (ret)
> >+			goto out_unlock;
> >+		if (!is_same) {
> >+			ret = -EBADE;
> >+			goto out_unlock;
> >+		}
> >+	}
> >+
> >+	/* Lock out changes to the allocation maps */
> >+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> >+	if (!same_inode)
> >+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
> >+				  SINGLE_DEPTH_NESTING);
> >+
> >+	/*
> >+	 * Invalidate the page cache so that we can clear any CoW mappings
> >+	 * in the destination file.
> >+	 */
> >+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> >+				   PAGE_ALIGN(pos_out + len) - 1);
> >+
> >+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
> >+					 out_bh, pos_out, len);
> >+
> >+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> >+	if (!same_inode)
> >+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_unlock;
> >+	}
> >+
> >+	/*
> >+	 * Empty the extent map so that we may get the right extent
> >+	 * record from the disk.
> >+	 */
> >+	ocfs2_extent_map_trunc(inode_in, 0);
> >+	ocfs2_extent_map_trunc(inode_out, 0);
> >+
> >+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_unlock;
> >+	}
> >+
> >+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> >+	return 0;
> >+
> >+out_unlock:
> >+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> >+	return ret;
> >+}
> >diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> >index 553edfb..c023e88 100644
> >--- a/fs/ocfs2/refcounttree.h
> >+++ b/fs/ocfs2/refcounttree.h
> >@@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
> >  			const char __user *oldname,
> >  			const char __user *newname,
> >  			bool preserve);
> >+int ocfs2_reflink_remap_range(struct file *file_in,
> >+			      loff_t pos_in,
> >+			      struct file *file_out,
> >+			      loff_t pos_out,
> >+			      u64 len,
> >+			      bool is_dedupe);
> >+
> >  #endif /* OCFS2_REFCOUNTTREE_H */
> >
> >
> >_______________________________________________
> >Ocfs2-devel mailing list
> >Ocfs2-devel@oss.oracle.com
> >https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-11  6:20       ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11  6:20 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
> Hi,
> 
> A few issues obvious to me:
> 
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Connect the new VFS clone_range, copy_range, and dedupe_range features
> >to the existing reflink capability of ocfs2.  Compared to the existing
> >ocfs2 reflink ioctl We have to do things a little differently to support
> >the VFS semantics (we can clone subranges of a file but we don't clone
> >xattrs), but the VFS ioctls are more broadly supported.
> 
> How can I test the new ocfs2 reflink (with this patch) manually? What
> commands should I use to do xxx_range things?

See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.

The first two were added in xfsprogs 4.3, and copy_range in 4.7.

--D

> 
> >
> >Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >---
> >  fs/ocfs2/file.c         |   62 ++++-
> >  fs/ocfs2/file.h         |    3
> >  fs/ocfs2/refcounttree.c |  619 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/ocfs2/refcounttree.h |    7 +
> >  4 files changed, 688 insertions(+), 3 deletions(-)
> >
> >
> >diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> >index 000c234..d5a022d 100644
> >--- a/fs/ocfs2/file.c
> >+++ b/fs/ocfs2/file.c
> >@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
> >  	*done = ret;
> >  }
> >-static int ocfs2_remove_inode_range(struct inode *inode,
> >-				    struct buffer_head *di_bh, u64 byte_start,
> >-				    u64 byte_len)
> >+int ocfs2_remove_inode_range(struct inode *inode,
> >+			     struct buffer_head *di_bh, u64 byte_start,
> >+			     u64 byte_len)
> >  {
> >  	int ret = 0, flags = 0, done = 0, i;
> >  	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
> >@@ -2440,6 +2440,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
> >  	return offset;
> >  }
> >+static ssize_t ocfs2_file_copy_range(struct file *file_in,
> >+				     loff_t pos_in,
> >+				     struct file *file_out,
> >+				     loff_t pos_out,
> >+				     size_t len,
> >+				     unsigned int flags)
> >+{
> >+	int error;
> >+
> >+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> >+					  len, false);
> >+	if (error)
> >+		return error;
> >+	return len;
> >+}
> >+
> >+static int ocfs2_file_clone_range(struct file *file_in,
> >+				  loff_t pos_in,
> >+				  struct file *file_out,
> >+				  loff_t pos_out,
> >+				  u64 len)
> >+{
> >+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> >+					 len, false);
> >+}
> >+
> >+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> >+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
> >+				       u64 loff,
> >+				       u64 len,
> >+				       struct file *dst_file,
> >+				       u64 dst_loff)
> >+{
> >+	int error;
> >+
> >+	/*
> >+	 * Limit the total length we will dedupe for each operation.
> >+	 * This is intended to bound the total time spent in this
> >+	 * ioctl to something sane.
> >+	 */
> >+	if (len > OCFS2_MAX_DEDUPE_LEN)
> >+		len = OCFS2_MAX_DEDUPE_LEN;
> >+
> >+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
> >+					  len, true);
> >+	if (error)
> >+		return error;
> >+	return len;
> >+}
> >+
> >  const struct inode_operations ocfs2_file_iops = {
> >  	.setattr	= ocfs2_setattr,
> >  	.getattr	= ocfs2_getattr,
> >@@ -2479,6 +2529,9 @@ const struct file_operations ocfs2_fops = {
> >  	.splice_read	= generic_file_splice_read,
> >  	.splice_write	= iter_file_splice_write,
> >  	.fallocate	= ocfs2_fallocate,
> >+	.copy_file_range = ocfs2_file_copy_range,
> >+	.clone_file_range = ocfs2_file_clone_range,
> >+	.dedupe_file_range = ocfs2_file_dedupe_range,
> >  };
> >  const struct file_operations ocfs2_dops = {
> >@@ -2524,6 +2577,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
> >  	.splice_read	= generic_file_splice_read,
> >  	.splice_write	= iter_file_splice_write,
> >  	.fallocate	= ocfs2_fallocate,
> >+	.copy_file_range = ocfs2_file_copy_range,
> >+	.clone_file_range = ocfs2_file_clone_range,
> >+	.dedupe_file_range = ocfs2_file_dedupe_range,
> >  };
> >  const struct file_operations ocfs2_dops_no_plocks = {
> >diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
> >index e8c62f2..897fd9a 100644
> >--- a/fs/ocfs2/file.h
> >+++ b/fs/ocfs2/file.h
> >@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
> >  int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
> >  				   size_t count);
> >+int ocfs2_remove_inode_range(struct inode *inode,
> >+			     struct buffer_head *di_bh, u64 byte_start,
> >+			     u64 byte_len);
> >  #endif /* OCFS2_FILE_H */
> >diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> >index d92b6c6..3e2198c 100644
> >--- a/fs/ocfs2/refcounttree.c
> >+++ b/fs/ocfs2/refcounttree.c
> >@@ -34,6 +34,7 @@
> >  #include "xattr.h"
> >  #include "namei.h"
> >  #include "ocfs2_trace.h"
> >+#include "file.h"
> >  #include <linux/bio.h>
> >  #include <linux/blkdev.h>
> >@@ -4447,3 +4448,621 @@ int ocfs2_reflink_ioctl(struct inode *inode,
> >  	return error;
> >  }
> >+
> >+/* Update destination inode size, if necessary. */
> >+static int ocfs2_reflink_update_dest(struct inode *dest,
> >+				     struct buffer_head *d_bh,
> >+				     loff_t newlen)
> >+{
> >+	handle_t *handle;
> >+	struct ocfs2_dinode *di = (struct ocfs2_dinode *)d_bh->b_data;
> >+	int ret;
> >+
> >+	if (newlen <= i_size_read(dest))
> >+		return 0;
> >+
> >+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
> >+				   OCFS2_INODE_UPDATE_CREDITS);
> >+	if (IS_ERR(handle)) {
> >+		ret = PTR_ERR(handle);
> >+		mlog_errno(ret);
> >+		return ret;
> >+	}
> >+
> >+	ret = ocfs2_journal_access_di(handle, INODE_CACHE(dest), d_bh,
> >+				      OCFS2_JOURNAL_ACCESS_WRITE);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_commit;
> >+	}
> >+
> >+	spin_lock(&OCFS2_I(dest)->ip_lock);
> >+	if (newlen > i_size_read(dest)) {
> >+		i_size_write(dest, newlen);
> >+		di->i_size = newlen;
> 
> di->i_size = cpu_to_le64(newlen);
> 
> >+	}
> >+	spin_unlock(&OCFS2_I(dest)->ip_lock);
> >+
> 
> Add ocfs2_update_inode_fsync_trans() here? Looks this function was
> introduced by you to improve efficiency.
> Just want to awake your memory about this, though I don't know about the
> details why it should be.
> 
> Eric
> 
> >+	ocfs2_journal_dirty(handle, d_bh);
> >+
> >+out_commit:
> >+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
> >+	return ret;
> >+}
> >+
> >+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
> >+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
> >+				      struct buffer_head *s_bh,
> >+				      loff_t pos_in,
> >+				      struct inode *t_inode,
> >+				      struct buffer_head *t_bh,
> >+				      loff_t pos_out,
> >+				      loff_t len,
> >+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
> >+{
> >+	struct ocfs2_extent_tree s_et;
> >+	struct ocfs2_extent_tree t_et;
> >+	struct ocfs2_dinode *dis;
> >+	struct buffer_head *ref_root_bh = NULL;
> >+	struct ocfs2_refcount_tree *ref_tree;
> >+	struct ocfs2_super *osb;
> >+	loff_t pstart, plen;
> >+	u32 p_cluster, num_clusters, slast, spos, tpos;
> >+	unsigned int ext_flags;
> >+	int ret = 0;
> >+
> >+	osb = OCFS2_SB(s_inode->i_sb);
> >+	dis = (struct ocfs2_dinode *)s_bh->b_data;
> >+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
> >+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
> >+
> >+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
> >+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
> >+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
> >+
> >+	while (spos < slast) {
> >+		if (fatal_signal_pending(current)) {
> >+			ret = -EINTR;
> >+			goto out;
> >+		}
> >+
> >+		/* Look up the extent. */
> >+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
> >+					 &num_clusters, &ext_flags);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		num_clusters = min_t(u32, num_clusters, slast - spos);
> >+
> >+		/* Punch out the dest range. */
> >+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
> >+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
> >+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		if (p_cluster == 0)
> >+			goto next_loop;
> >+
> >+		/* Lock the refcount btree... */
> >+		ret = ocfs2_lock_refcount_tree(osb,
> >+					       le64_to_cpu(dis->i_refcount_loc),
> >+					       1, &ref_tree, &ref_root_bh);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+
> >+		/* Mark s_inode's extent as refcounted. */
> >+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
> >+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
> >+						      &ref_tree->rf_ci,
> >+						      ref_root_bh, spos,
> >+						      p_cluster, num_clusters,
> >+						      dealloc, NULL);
> >+			if (ret) {
> >+				mlog_errno(ret);
> >+				goto out_unlock_refcount;
> >+			}
> >+		}
> >+
> >+		/* Map in the new extent. */
> >+		ext_flags |= OCFS2_EXT_REFCOUNTED;
> >+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
> >+						  &ref_tree->rf_ci,
> >+						  ref_root_bh,
> >+						  tpos, p_cluster,
> >+						  num_clusters,
> >+						  ext_flags,
> >+						  dealloc);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out_unlock_refcount;
> >+		}
> >+
> >+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> >+		brelse(ref_root_bh);
> >+next_loop:
> >+		spos += num_clusters;
> >+		tpos += num_clusters;
> >+	}
> >+
> >+out:
> >+	return ret;
> >+out_unlock_refcount:
> >+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
> >+	brelse(ref_root_bh);
> >+	return ret;
> >+}
> >+
> >+/* Set up refcount tree and remap s_inode to t_inode. */
> >+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
> >+				      struct buffer_head *s_bh,
> >+				      loff_t pos_in,
> >+				      struct inode *t_inode,
> >+				      struct buffer_head *t_bh,
> >+				      loff_t pos_out,
> >+				      loff_t len)
> >+{
> >+	struct ocfs2_cached_dealloc_ctxt dealloc;
> >+	struct ocfs2_super *osb;
> >+	struct ocfs2_dinode *dis;
> >+	struct ocfs2_dinode *dit;
> >+	int ret;
> >+
> >+	osb = OCFS2_SB(s_inode->i_sb);
> >+	dis = (struct ocfs2_dinode *)s_bh->b_data;
> >+	dit = (struct ocfs2_dinode *)t_bh->b_data;
> >+	ocfs2_init_dealloc_ctxt(&dealloc);
> >+
> >+	/*
> >+	 * If both inodes belong to two different refcount groups then
> >+	 * forget it because we don't know how (or want) to go merging
> >+	 * refcount trees.
> >+	 */
> >+	ret = -EOPNOTSUPP;
> >+	if (ocfs2_is_refcount_inode(s_inode) &&
> >+	    ocfs2_is_refcount_inode(t_inode) &&
> >+	    le64_to_cpu(dis->i_refcount_loc) !=
> >+	    le64_to_cpu(dit->i_refcount_loc))
> >+		goto out;
> >+
> >+	/* Neither inode has a refcount tree.  Add one to s_inode. */
> >+	if (!ocfs2_is_refcount_inode(s_inode) &&
> >+	    !ocfs2_is_refcount_inode(t_inode)) {
> >+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	/* Ensure that both inodes end up with the same refcount tree. */
> >+	if (!ocfs2_is_refcount_inode(s_inode)) {
> >+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
> >+					      le64_to_cpu(dit->i_refcount_loc));
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+	if (!ocfs2_is_refcount_inode(t_inode)) {
> >+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
> >+					      le64_to_cpu(dis->i_refcount_loc));
> >+		if (ret) {
> >+			mlog_errno(ret);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	/*
> >+	 * If we're reflinking the entire file and the source is inline
> >+	 * data, just copy the contents.
> >+	 */
> >+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
> >+	    i_size_read(t_inode) <= len &&
> >+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
> >+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
> >+		if (ret)
> >+			mlog_errno(ret);
> >+		goto out;
> >+	}
> >+
> >+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
> >+					 pos_out, len, &dealloc);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out;
> >+	}
> >+
> >+out:
> >+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
> >+		ocfs2_schedule_truncate_log_flush(osb, 1);
> >+		ocfs2_run_deallocs(osb, &dealloc);
> >+	}
> >+
> >+	return ret;
> >+}
> >+
> >+/* Lock an inode and grab a bh pointing to the inode. */
> >+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
> >+				     struct buffer_head **bh1,
> >+				     struct inode *t_inode,
> >+				     struct buffer_head **bh2)
> >+{
> >+	struct inode *inode1;
> >+	struct inode *inode2;
> >+	struct ocfs2_inode_info *oi1;
> >+	struct ocfs2_inode_info *oi2;
> >+	bool same_inode = (s_inode == t_inode);
> >+	int status;
> >+
> >+	/* First grab the VFS and rw locks. */
> >+	inode1 = s_inode;
> >+	inode2 = t_inode;
> >+	if (inode1->i_ino > inode2->i_ino)
> >+		swap(inode1, inode2);
> >+
> >+	inode_lock(inode1);
> >+	status = ocfs2_rw_lock(inode1, 1);
> >+	if (status) {
> >+		mlog_errno(status);
> >+		goto out_i1;
> >+	}
> >+	if (!same_inode) {
> >+		inode_lock_nested(inode2, I_MUTEX_CHILD);
> >+		status = ocfs2_rw_lock(inode2, 1);
> >+		if (status) {
> >+			mlog_errno(status);
> >+			goto out_i2;
> >+		}
> >+	}
> >+
> >+	/* Now go for the cluster locks */
> >+	oi1 = OCFS2_I(inode1);
> >+	oi2 = OCFS2_I(inode2);
> >+
> >+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
> >+				(unsigned long long)oi2->ip_blkno);
> >+
> >+	if (*bh1)
> >+		*bh1 = NULL;
> >+	if (*bh2)
> >+		*bh2 = NULL;
> >+
> >+	/* We always want to lock the one with the lower lockid first. */
> >+	if (oi1->ip_blkno > oi2->ip_blkno)
> >+		mlog_errno(-ENOLCK);
> >+
> >+	/* lock id1 */
> >+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
> >+	if (status < 0) {
> >+		if (status != -ENOENT)
> >+			mlog_errno(status);
> >+		goto out_rw2;
> >+	}
> >+
> >+	/* lock id2 */
> >+	if (!same_inode) {
> >+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
> >+						 OI_LS_REFLINK_TARGET);
> >+		if (status < 0) {
> >+			if (status != -ENOENT)
> >+				mlog_errno(status);
> >+			goto out_cl1;
> >+		}
> >+	} else
> >+		*bh2 = *bh1;
> >+
> >+	trace_ocfs2_double_lock_end(
> >+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
> >+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
> >+
> >+	return 0;
> >+
> >+out_cl1:
> >+	ocfs2_inode_unlock(inode1, 1);
> >+	brelse(*bh1);
> >+	*bh1 = NULL;
> >+out_rw2:
> >+	ocfs2_rw_unlock(inode2, 1);
> >+out_i2:
> >+	inode_unlock(inode2);
> >+	ocfs2_rw_unlock(inode1, 1);
> >+out_i1:
> >+	inode_unlock(inode1);
> >+	return status;
> >+}
> >+
> >+/* Unlock both inodes and release buffers. */
> >+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
> >+					struct buffer_head *s_bh,
> >+					struct inode *t_inode,
> >+					struct buffer_head *t_bh)
> >+{
> >+	ocfs2_inode_unlock(s_inode, 1);
> >+	ocfs2_rw_unlock(s_inode, 1);
> >+	inode_unlock(s_inode);
> >+	brelse(s_bh);
> >+
> >+	if (s_inode == t_inode)
> >+		return;
> >+
> >+	ocfs2_inode_unlock(t_inode, 1);
> >+	ocfs2_rw_unlock(t_inode, 1);
> >+	inode_unlock(t_inode);
> >+	brelse(t_bh);
> >+}
> >+
> >+/*
> >+ * Read a page's worth of file data into the page cache.  Return the page
> >+ * locked.
> >+ */
> >+static struct page *ocfs2_reflink_get_page(struct inode *inode,
> >+					   loff_t offset)
> >+{
> >+	struct address_space *mapping;
> >+	struct page *page;
> >+	pgoff_t n;
> >+
> >+	n = offset >> PAGE_SHIFT;
> >+	mapping = inode->i_mapping;
> >+	page = read_mapping_page(mapping, n, NULL);
> >+	if (IS_ERR(page))
> >+		return page;
> >+	if (!PageUptodate(page)) {
> >+		put_page(page);
> >+		return ERR_PTR(-EIO);
> >+	}
> >+	lock_page(page);
> >+	return page;
> >+}
> >+
> >+/*
> >+ * Compare extents of two files to see if they are the same.
> >+ */
> >+static int ocfs2_reflink_compare_extents(struct inode *src,
> >+					 loff_t srcoff,
> >+					 struct inode *dest,
> >+					 loff_t destoff,
> >+					 loff_t len,
> >+					 bool *is_same)
> >+{
> >+	loff_t src_poff;
> >+	loff_t dest_poff;
> >+	void *src_addr;
> >+	void *dest_addr;
> >+	struct page *src_page;
> >+	struct page *dest_page;
> >+	loff_t cmp_len;
> >+	bool same;
> >+	int error;
> >+
> >+	error = -EINVAL;
> >+	same = true;
> >+	while (len) {
> >+		src_poff = srcoff & (PAGE_SIZE - 1);
> >+		dest_poff = destoff & (PAGE_SIZE - 1);
> >+		cmp_len = min(PAGE_SIZE - src_poff,
> >+			      PAGE_SIZE - dest_poff);
> >+		cmp_len = min(cmp_len, len);
> >+		if (cmp_len <= 0) {
> >+			mlog_errno(-EUCLEAN);
> >+			goto out_error;
> >+		}
> >+
> >+		src_page = ocfs2_reflink_get_page(src, srcoff);
> >+		if (IS_ERR(src_page)) {
> >+			error = PTR_ERR(src_page);
> >+			goto out_error;
> >+		}
> >+		dest_page = ocfs2_reflink_get_page(dest, destoff);
> >+		if (IS_ERR(dest_page)) {
> >+			error = PTR_ERR(dest_page);
> >+			unlock_page(src_page);
> >+			put_page(src_page);
> >+			goto out_error;
> >+		}
> >+		src_addr = kmap_atomic(src_page);
> >+		dest_addr = kmap_atomic(dest_page);
> >+
> >+		flush_dcache_page(src_page);
> >+		flush_dcache_page(dest_page);
> >+
> >+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> >+			same = false;
> >+
> >+		kunmap_atomic(dest_addr);
> >+		kunmap_atomic(src_addr);
> >+		unlock_page(dest_page);
> >+		unlock_page(src_page);
> >+		put_page(dest_page);
> >+		put_page(src_page);
> >+
> >+		if (!same)
> >+			break;
> >+
> >+		srcoff += cmp_len;
> >+		destoff += cmp_len;
> >+		len -= cmp_len;
> >+	}
> >+
> >+	*is_same = same;
> >+	return 0;
> >+
> >+out_error:
> >+	return error;
> >+}
> >+
> >+/* Link a range of blocks from one file to another. */
> >+int ocfs2_reflink_remap_range(struct file *file_in,
> >+			      loff_t pos_in,
> >+			      struct file *file_out,
> >+			      loff_t pos_out,
> >+			      u64 len,
> >+			      bool is_dedupe)
> >+{
> >+	struct inode *inode_in = file_inode(file_in);
> >+	struct inode *inode_out = file_inode(file_out);
> >+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
> >+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
> >+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
> >+	bool same_inode = (inode_in == inode_out);
> >+	bool is_same = false;
> >+	loff_t isize;
> >+	ssize_t ret;
> >+	loff_t blen;
> >+
> >+	if (!ocfs2_refcount_tree(osb))
> >+		return -EOPNOTSUPP;
> >+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
> >+		return -EROFS;
> >+
> >+	/* Lock both files against IO */
> >+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
> >+	if (ret)
> >+		return ret;
> >+
> >+	ret = -EINVAL;
> >+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
> >+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
> >+		goto out_unlock;
> >+
> >+	/* Don't touch certain kinds of inodes */
> >+	ret = -EPERM;
> >+	if (IS_IMMUTABLE(inode_out))
> >+		goto out_unlock;
> >+
> >+	ret = -ETXTBSY;
> >+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> >+		goto out_unlock;
> >+
> >+	/* Don't reflink dirs, pipes, sockets... */
> >+	ret = -EISDIR;
> >+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> >+		goto out_unlock;
> >+	ret = -EINVAL;
> >+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
> >+		goto out_unlock;
> >+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> >+		goto out_unlock;
> >+
> >+	/* Are we going all the way to the end? */
> >+	isize = i_size_read(inode_in);
> >+	if (isize == 0) {
> >+		ret = 0;
> >+		goto out_unlock;
> >+	}
> >+
> >+	if (len == 0)
> >+		len = isize - pos_in;
> >+
> >+	/* Ensure offsets don't wrap and the input is inside i_size */
> >+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
> >+	    pos_in + len > isize)
> >+		goto out_unlock;
> >+
> >+	/* Don't allow dedupe past EOF in the dest file */
> >+	if (is_dedupe) {
> >+		loff_t	disize;
> >+
> >+		disize = i_size_read(inode_out);
> >+		if (pos_out >= disize || pos_out + len > disize)
> >+			goto out_unlock;
> >+	}
> >+
> >+	/* If we're linking to EOF, continue to the block boundary. */
> >+	if (pos_in + len == isize)
> >+		blen = ALIGN(isize, bs) - pos_in;
> >+	else
> >+		blen = len;
> >+
> >+	/* Only reflink if we're aligned to block boundaries */
> >+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
> >+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
> >+		goto out_unlock;
> >+
> >+	/* Don't allow overlapped reflink within the same file */
> >+	if (same_inode) {
> >+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
> >+			goto out_unlock;
> >+	}
> >+
> >+	/* Wait for the completion of any pending IOs on both files */
> >+	inode_dio_wait(inode_in);
> >+	if (!same_inode)
> >+		inode_dio_wait(inode_out);
> >+
> >+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
> >+			pos_in, pos_in + len - 1);
> >+	if (ret)
> >+		goto out_unlock;
> >+
> >+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
> >+			pos_out, pos_out + len - 1);
> >+	if (ret)
> >+		goto out_unlock;
> >+
> >+	/*
> >+	 * Check that the extents are the same.
> >+	 */
> >+	if (is_dedupe) {
> >+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
> >+						    inode_out, pos_out,
> >+						    len, &is_same);
> >+		if (ret)
> >+			goto out_unlock;
> >+		if (!is_same) {
> >+			ret = -EBADE;
> >+			goto out_unlock;
> >+		}
> >+	}
> >+
> >+	/* Lock out changes to the allocation maps */
> >+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> >+	if (!same_inode)
> >+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
> >+				  SINGLE_DEPTH_NESTING);
> >+
> >+	/*
> >+	 * Invalidate the page cache so that we can clear any CoW mappings
> >+	 * in the destination file.
> >+	 */
> >+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> >+				   PAGE_ALIGN(pos_out + len) - 1);
> >+
> >+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
> >+					 out_bh, pos_out, len);
> >+
> >+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
> >+	if (!same_inode)
> >+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_unlock;
> >+	}
> >+
> >+	/*
> >+	 * Empty the extent map so that we may get the right extent
> >+	 * record from the disk.
> >+	 */
> >+	ocfs2_extent_map_trunc(inode_in, 0);
> >+	ocfs2_extent_map_trunc(inode_out, 0);
> >+
> >+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
> >+	if (ret) {
> >+		mlog_errno(ret);
> >+		goto out_unlock;
> >+	}
> >+
> >+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> >+	return 0;
> >+
> >+out_unlock:
> >+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
> >+	return ret;
> >+}
> >diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
> >index 553edfb..c023e88 100644
> >--- a/fs/ocfs2/refcounttree.h
> >+++ b/fs/ocfs2/refcounttree.h
> >@@ -117,4 +117,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
> >  			const char __user *oldname,
> >  			const char __user *newname,
> >  			bool preserve);
> >+int ocfs2_reflink_remap_range(struct file *file_in,
> >+			      loff_t pos_in,
> >+			      struct file *file_out,
> >+			      loff_t pos_out,
> >+			      u64 len,
> >+			      bool is_dedupe);
> >+
> >  #endif /* OCFS2_REFCOUNTTREE_H */
> >
> >
> >_______________________________________________
> >Ocfs2-devel mailing list
> >Ocfs2-devel at oss.oracle.com
> >https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-11  6:20       ` Darrick J. Wong
@ 2016-11-11  6:45         ` Eric Ren
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  6:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On 11/11/2016 02:20 PM, Darrick J. Wong wrote:
> On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
>> Hi,
>>
>> A few issues obvious to me:
>>
>> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
>>> Connect the new VFS clone_range, copy_range, and dedupe_range features
>>> to the existing reflink capability of ocfs2.  Compared to the existing
>>> ocfs2 reflink ioctl We have to do things a little differently to support
>>> the VFS semantics (we can clone subranges of a file but we don't clone
>>> xattrs), but the VFS ioctls are more broadly supported.
>> How can I test the new ocfs2 reflink (with this patch) manually? What
>> commands should I use to do xxx_range things?
> See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.
>
> The first two were added in xfsprogs 4.3, and copy_range in 4.7.

OK, thanks. I think you are missing the following two inline comments:

>>> +	spin_lock(&OCFS2_I(dest)->ip_lock);
>>> +	if (newlen > i_size_read(dest)) {
>>> +		i_size_write(dest, newlen);
>>> +		di->i_size = newlen;
>> di->i_size = cpu_to_le64(newlen);
>>
>>> +	}
>>> +	spin_unlock(&OCFS2_I(dest)->ip_lock);
>>> +
>> Add ocfs2_update_inode_fsync_trans() here? Looks this function was
>> introduced by you to improve efficiency.
>> Just want to awake your memory about this, though I don't know about the
>> details why it should be.
>>
>> Eric
Thanks,
Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-11  6:45         ` Eric Ren
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Ren @ 2016-11-11  6:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On 11/11/2016 02:20 PM, Darrick J. Wong wrote:
> On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
>> Hi,
>>
>> A few issues obvious to me:
>>
>> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
>>> Connect the new VFS clone_range, copy_range, and dedupe_range features
>>> to the existing reflink capability of ocfs2.  Compared to the existing
>>> ocfs2 reflink ioctl We have to do things a little differently to support
>>> the VFS semantics (we can clone subranges of a file but we don't clone
>>> xattrs), but the VFS ioctls are more broadly supported.
>> How can I test the new ocfs2 reflink (with this patch) manually? What
>> commands should I use to do xxx_range things?
> See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.
>
> The first two were added in xfsprogs 4.3, and copy_range in 4.7.

OK, thanks. I think you are missing the following two inline comments:

>>> +	spin_lock(&OCFS2_I(dest)->ip_lock);
>>> +	if (newlen > i_size_read(dest)) {
>>> +		i_size_write(dest, newlen);
>>> +		di->i_size = newlen;
>> di->i_size = cpu_to_le64(newlen);
>>
>>> +	}
>>> +	spin_unlock(&OCFS2_I(dest)->ip_lock);
>>> +
>> Add ocfs2_update_inode_fsync_trans() here? Looks this function was
>> introduced by you to improve efficiency.
>> Just want to awake your memory about this, though I don't know about the
>> details why it should be.
>>
>> Eric
Thanks,
Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-11  6:45         ` Eric Ren
@ 2016-11-11  9:01           ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11  9:01 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 02:45:54PM +0800, Eric Ren wrote:
> On 11/11/2016 02:20 PM, Darrick J. Wong wrote:
> >On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
> >>Hi,
> >>
> >>A few issues obvious to me:
> >>
> >>On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >>>Connect the new VFS clone_range, copy_range, and dedupe_range features
> >>>to the existing reflink capability of ocfs2.  Compared to the existing
> >>>ocfs2 reflink ioctl We have to do things a little differently to support
> >>>the VFS semantics (we can clone subranges of a file but we don't clone
> >>>xattrs), but the VFS ioctls are more broadly supported.
> >>How can I test the new ocfs2 reflink (with this patch) manually? What
> >>commands should I use to do xxx_range things?
> >See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.
> >
> >The first two were added in xfsprogs 4.3, and copy_range in 4.7.
> 
> OK, thanks. I think you are missing the following two inline comments:
> 
> >>>+	spin_lock(&OCFS2_I(dest)->ip_lock);
> >>>+	if (newlen > i_size_read(dest)) {
> >>>+		i_size_write(dest, newlen);
> >>>+		di->i_size = newlen;
> >>di->i_size = cpu_to_le64(newlen);

Good catch!

> >>>+	}
> >>>+	spin_unlock(&OCFS2_I(dest)->ip_lock);
> >>>+
> >>Add ocfs2_update_inode_fsync_trans() here? Looks this function was
> >>introduced by you to improve efficiency.
> >>Just want to awake your memory about this, though I don't know about the
> >>details why it should be.

D'oh!  Yes, I did miss that.

The function updates the destination inode's information.  Specifically,
it updates i_size if we reflinked blocks into the file past EOF.
Looking at it some more, I also need to update i_blocks or the stat(2) info
will be wrong, and I also need to convert inline data to extents prior
to reflinking.

--D

> >>
> >>Eric
> Thanks,
> Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-11  9:01           ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11  9:01 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 02:45:54PM +0800, Eric Ren wrote:
> On 11/11/2016 02:20 PM, Darrick J. Wong wrote:
> >On Fri, Nov 11, 2016 at 01:49:48PM +0800, Eric Ren wrote:
> >>Hi,
> >>
> >>A few issues obvious to me:
> >>
> >>On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >>>Connect the new VFS clone_range, copy_range, and dedupe_range features
> >>>to the existing reflink capability of ocfs2.  Compared to the existing
> >>>ocfs2 reflink ioctl We have to do things a little differently to support
> >>>the VFS semantics (we can clone subranges of a file but we don't clone
> >>>xattrs), but the VFS ioctls are more broadly supported.
> >>How can I test the new ocfs2 reflink (with this patch) manually? What
> >>commands should I use to do xxx_range things?
> >See the 'reflink', 'dedupe', and 'copy_range' commands in xfs_io.
> >
> >The first two were added in xfsprogs 4.3, and copy_range in 4.7.
> 
> OK, thanks. I think you are missing the following two inline comments:
> 
> >>>+	spin_lock(&OCFS2_I(dest)->ip_lock);
> >>>+	if (newlen > i_size_read(dest)) {
> >>>+		i_size_write(dest, newlen);
> >>>+		di->i_size = newlen;
> >>di->i_size = cpu_to_le64(newlen);

Good catch!

> >>>+	}
> >>>+	spin_unlock(&OCFS2_I(dest)->ip_lock);
> >>>+
> >>Add ocfs2_update_inode_fsync_trans() here? Looks this function was
> >>introduced by you to improve efficiency.
> >>Just want to awake your memory about this, though I don't know about the
> >>details why it should be.

D'oh!  Yes, I did miss that.

The function updates the destination inode's information.  Specifically,
it updates i_size if we reflinked blocks into the file past EOF.
Looking at it some more, I also need to update i_blocks or the stat(2) info
will be wrong, and I also need to convert inline data to extents prior
to reflinking.

--D

> >>
> >>Eric
> Thanks,
> Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
@ 2016-11-11 14:54     ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11 14:54 UTC (permalink / raw)
  To: mfasheh, jlbec, zren; +Cc: linux-fsdevel, ocfs2-devel

Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Convert inline data files to extents files before reflinking,
and fix i_blocks so that stat(2) output is correct.  fsync the inoe
correctly.
---
 fs/ocfs2/file.c         |   62 ++++-
 fs/ocfs2/file.h         |    3 
 fs/ocfs2/refcounttree.c |  627 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h |    7 +
 4 files changed, 696 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d261f3a..71aad0e 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
 	*done = ret;
 }
 
-static int ocfs2_remove_inode_range(struct inode *inode,
-				    struct buffer_head *di_bh, u64 byte_start,
-				    u64 byte_len)
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len)
 {
 	int ret = 0, flags = 0, done = 0, i;
 	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
@@ -2439,6 +2439,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static ssize_t ocfs2_file_copy_range(struct file *file_in,
+				     loff_t pos_in,
+				     struct file *file_out,
+				     loff_t pos_out,
+				     size_t len,
+				     unsigned int flags)
+{
+	int error;
+
+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					  len, false);
+	if (error)
+		return error;
+	return len;
+}
+
+static int ocfs2_file_clone_range(struct file *file_in,
+				  loff_t pos_in,
+				  struct file *file_out,
+				  loff_t pos_out,
+				  u64 len)
+{
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					 len, false);
+}
+
+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
+				       u64 loff,
+				       u64 len,
+				       struct file *dst_file,
+				       u64 dst_loff)
+{
+	int error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > OCFS2_MAX_DEDUPE_LEN)
+		len = OCFS2_MAX_DEDUPE_LEN;
+
+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
+					  len, true);
+	if (error)
+		return error;
+	return len;
+}
+
 const struct inode_operations ocfs2_file_iops = {
 	.setattr	= ocfs2_setattr,
 	.getattr	= ocfs2_getattr,
@@ -2478,6 +2528,9 @@ const struct file_operations ocfs2_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops = {
@@ -2523,6 +2576,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops_no_plocks = {
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index e8c62f2..897fd9a 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 
 int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 				   size_t count);
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len);
 #endif /* OCFS2_FILE_H */
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 6c98d56..be51540 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -34,6 +34,7 @@
 #include "xattr.h"
 #include "namei.h"
 #include "ocfs2_trace.h"
+#include "file.h"
 
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -4441,3 +4442,629 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 
 	return error;
 }
+
+/* Update destination inode size, if necessary. */
+static int ocfs2_reflink_update_dest(struct inode *dest,
+				     struct buffer_head *d_bh,
+				     loff_t newlen)
+{
+	handle_t *handle;
+	int ret;
+
+	dest->i_blocks = ocfs2_inode_sector_count(dest);
+
+	if (newlen <= i_size_read(dest))
+		return 0;
+
+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
+				   OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		return ret;
+	}
+
+	/* Extend i_size if needed. */
+	spin_lock(&OCFS2_I(dest)->ip_lock);
+	if (newlen > i_size_read(dest))
+		i_size_write(dest, newlen);
+	spin_unlock(&OCFS2_I(dest)->ip_lock);
+	dest->i_ctime = dest->i_mtime = current_time(dest);
+
+	ret = ocfs2_mark_inode_dirty(handle, dest, d_bh);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_commit;
+	}
+
+out_commit:
+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
+	return ret;
+}
+
+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len,
+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
+{
+	struct ocfs2_extent_tree s_et;
+	struct ocfs2_extent_tree t_et;
+	struct ocfs2_dinode *dis;
+	struct buffer_head *ref_root_bh = NULL;
+	struct ocfs2_refcount_tree *ref_tree;
+	struct ocfs2_super *osb;
+	loff_t pstart, plen;
+	u32 p_cluster, num_clusters, slast, spos, tpos;
+	unsigned int ext_flags;
+	int ret = 0;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
+
+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
+
+	while (spos < slast) {
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+
+		/* Look up the extent. */
+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
+					 &num_clusters, &ext_flags);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		num_clusters = min_t(u32, num_clusters, slast - spos);
+
+		/* Punch out the dest range. */
+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		if (p_cluster == 0)
+			goto next_loop;
+
+		/* Lock the refcount btree... */
+		ret = ocfs2_lock_refcount_tree(osb,
+					       le64_to_cpu(dis->i_refcount_loc),
+					       1, &ref_tree, &ref_root_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		/* Mark s_inode's extent as refcounted. */
+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
+						      &ref_tree->rf_ci,
+						      ref_root_bh, spos,
+						      p_cluster, num_clusters,
+						      dealloc, NULL);
+			if (ret) {
+				mlog_errno(ret);
+				goto out_unlock_refcount;
+			}
+		}
+
+		/* Map in the new extent. */
+		ext_flags |= OCFS2_EXT_REFCOUNTED;
+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
+						  &ref_tree->rf_ci,
+						  ref_root_bh,
+						  tpos, p_cluster,
+						  num_clusters,
+						  ext_flags,
+						  dealloc);
+		if (ret) {
+			mlog_errno(ret);
+			goto out_unlock_refcount;
+		}
+
+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+		brelse(ref_root_bh);
+next_loop:
+		spos += num_clusters;
+		tpos += num_clusters;
+	}
+
+out:
+	return ret;
+out_unlock_refcount:
+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+	brelse(ref_root_bh);
+	return ret;
+}
+
+/* Set up refcount tree and remap s_inode to t_inode. */
+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len)
+{
+	struct ocfs2_cached_dealloc_ctxt dealloc;
+	struct ocfs2_super *osb;
+	struct ocfs2_dinode *dis;
+	struct ocfs2_dinode *dit;
+	int ret;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	dit = (struct ocfs2_dinode *)t_bh->b_data;
+	ocfs2_init_dealloc_ctxt(&dealloc);
+
+	/*
+	 * If we're reflinking the entire file and the source is inline
+	 * data, just copy the contents.
+	 */
+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
+	    i_size_read(t_inode) <= len &&
+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
+		if (ret)
+			mlog_errno(ret);
+		goto out;
+	}
+
+	/*
+	 * If both inodes belong to two different refcount groups then
+	 * forget it because we don't know how (or want) to go merging
+	 * refcount trees.
+	 */
+	ret = -EOPNOTSUPP;
+	if (ocfs2_is_refcount_inode(s_inode) &&
+	    ocfs2_is_refcount_inode(t_inode) &&
+	    le64_to_cpu(dis->i_refcount_loc) !=
+	    le64_to_cpu(dit->i_refcount_loc))
+		goto out;
+
+	/* Neither inode has a refcount tree.  Add one to s_inode. */
+	if (!ocfs2_is_refcount_inode(s_inode) &&
+	    !ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Ensure that both inodes end up with the same refcount tree. */
+	if (!ocfs2_is_refcount_inode(s_inode)) {
+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
+					      le64_to_cpu(dit->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+	if (!ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
+					      le64_to_cpu(dis->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Turn off inline data in the dest file. */
+	if (OCFS2_I(t_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+		ret = ocfs2_convert_inline_data_to_extents(t_inode, t_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Actually remap extents now. */
+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
+					 pos_out, len, &dealloc);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+out:
+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
+		ocfs2_schedule_truncate_log_flush(osb, 1);
+		ocfs2_run_deallocs(osb, &dealloc);
+	}
+
+	return ret;
+}
+
+/* Lock an inode and grab a bh pointing to the inode. */
+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
+				     struct buffer_head **bh1,
+				     struct inode *t_inode,
+				     struct buffer_head **bh2)
+{
+	struct inode *inode1;
+	struct inode *inode2;
+	struct ocfs2_inode_info *oi1;
+	struct ocfs2_inode_info *oi2;
+	bool same_inode = (s_inode == t_inode);
+	int status;
+
+	/* First grab the VFS and rw locks. */
+	inode1 = s_inode;
+	inode2 = t_inode;
+	if (inode1->i_ino > inode2->i_ino)
+		swap(inode1, inode2);
+
+	inode_lock(inode1);
+	status = ocfs2_rw_lock(inode1, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out_i1;
+	}
+	if (!same_inode) {
+		inode_lock_nested(inode2, I_MUTEX_CHILD);
+		status = ocfs2_rw_lock(inode2, 1);
+		if (status) {
+			mlog_errno(status);
+			goto out_i2;
+		}
+	}
+
+	/* Now go for the cluster locks */
+	oi1 = OCFS2_I(inode1);
+	oi2 = OCFS2_I(inode2);
+
+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
+				(unsigned long long)oi2->ip_blkno);
+
+	if (*bh1)
+		*bh1 = NULL;
+	if (*bh2)
+		*bh2 = NULL;
+
+	/* We always want to lock the one with the lower lockid first. */
+	if (oi1->ip_blkno > oi2->ip_blkno)
+		mlog_errno(-ENOLCK);
+
+	/* lock id1 */
+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
+	if (status < 0) {
+		if (status != -ENOENT)
+			mlog_errno(status);
+		goto out_rw2;
+	}
+
+	/* lock id2 */
+	if (!same_inode) {
+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
+						 OI_LS_REFLINK_TARGET);
+		if (status < 0) {
+			if (status != -ENOENT)
+				mlog_errno(status);
+			goto out_cl1;
+		}
+	} else
+		*bh2 = *bh1;
+
+	trace_ocfs2_double_lock_end(
+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
+
+	return 0;
+
+out_cl1:
+	ocfs2_inode_unlock(inode1, 1);
+	brelse(*bh1);
+	*bh1 = NULL;
+out_rw2:
+	ocfs2_rw_unlock(inode2, 1);
+out_i2:
+	inode_unlock(inode2);
+	ocfs2_rw_unlock(inode1, 1);
+out_i1:
+	inode_unlock(inode1);
+	return status;
+}
+
+/* Unlock both inodes and release buffers. */
+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
+					struct buffer_head *s_bh,
+					struct inode *t_inode,
+					struct buffer_head *t_bh)
+{
+	ocfs2_inode_unlock(s_inode, 1);
+	ocfs2_rw_unlock(s_inode, 1);
+	inode_unlock(s_inode);
+	brelse(s_bh);
+
+	if (s_inode == t_inode)
+		return;
+
+	ocfs2_inode_unlock(t_inode, 1);
+	ocfs2_rw_unlock(t_inode, 1);
+	inode_unlock(t_inode);
+	brelse(t_bh);
+}
+
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *ocfs2_reflink_get_page(struct inode *inode,
+					   loff_t offset)
+{
+	struct address_space *mapping;
+	struct page *page;
+	pgoff_t n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int ocfs2_reflink_compare_extents(struct inode *src,
+					 loff_t srcoff,
+					 struct inode *dest,
+					 loff_t destoff,
+					 loff_t len,
+					 bool *is_same)
+{
+	loff_t src_poff;
+	loff_t dest_poff;
+	void *src_addr;
+	void *dest_addr;
+	struct page *src_page;
+	struct page *dest_page;
+	loff_t cmp_len;
+	bool same;
+	int error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		if (cmp_len <= 0) {
+			mlog_errno(-EUCLEAN);
+			goto out_error;
+		}
+
+		src_page = ocfs2_reflink_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = ocfs2_reflink_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Link a range of blocks from one file to another. */
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
+	bool same_inode = (inode_in == inode_out);
+	bool is_same = false;
+	loff_t isize;
+	ssize_t ret;
+	loff_t blen;
+
+	if (!ocfs2_refcount_tree(osb))
+		return -EOPNOTSUPP;
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	/* Lock both files against IO */
+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
+		goto out_unlock;
+
+	/* Don't touch certain kinds of inodes */
+	ret = -EPERM;
+	if (IS_IMMUTABLE(inode_out))
+		goto out_unlock;
+
+	ret = -ETXTBSY;
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		goto out_unlock;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	ret = -EISDIR;
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		goto out_unlock;
+	ret = -EINVAL;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		goto out_unlock;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		goto out_unlock;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		goto out_unlock;
+
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			goto out_unlock;
+	}
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		goto out_unlock;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode) {
+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
+			goto out_unlock;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode_in);
+	if (!same_inode)
+		inode_dio_wait(inode_out);
+
+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
+			pos_in, pos_in + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
+			pos_out, pos_out + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (is_dedupe) {
+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
+						    inode_out, pos_out,
+						    len, &is_same);
+		if (ret)
+			goto out_unlock;
+		if (!is_same) {
+			ret = -EBADE;
+			goto out_unlock;
+		}
+	}
+
+	/* Lock out changes to the allocation maps */
+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
+				  SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
+
+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	/*
+	 * Empty the extent map so that we may get the right extent
+	 * record from the disk.
+	 */
+	ocfs2_extent_map_trunc(inode_in, 0);
+	ocfs2_extent_map_trunc(inode_out, 0);
+
+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return 0;
+
+out_unlock:
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return ret;
+}
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 6422bbc..4af55bf 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -115,4 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe);
+
 #endif /* OCFS2_REFCOUNTTREE_H */

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH v2 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
@ 2016-11-11 14:54     ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11 14:54 UTC (permalink / raw)
  To: mfasheh, jlbec, zren; +Cc: linux-fsdevel, ocfs2-devel

Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Convert inline data files to extents files before reflinking,
and fix i_blocks so that stat(2) output is correct.  fsync the inoe
correctly.
---
 fs/ocfs2/file.c         |   62 ++++-
 fs/ocfs2/file.h         |    3 
 fs/ocfs2/refcounttree.c |  627 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h |    7 +
 4 files changed, 696 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d261f3a..71aad0e 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1667,9 +1667,9 @@ static void ocfs2_calc_trunc_pos(struct inode *inode,
 	*done = ret;
 }
 
-static int ocfs2_remove_inode_range(struct inode *inode,
-				    struct buffer_head *di_bh, u64 byte_start,
-				    u64 byte_len)
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len)
 {
 	int ret = 0, flags = 0, done = 0, i;
 	u32 trunc_start, trunc_len, trunc_end, trunc_cpos, phys_cpos;
@@ -2439,6 +2439,56 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static ssize_t ocfs2_file_copy_range(struct file *file_in,
+				     loff_t pos_in,
+				     struct file *file_out,
+				     loff_t pos_out,
+				     size_t len,
+				     unsigned int flags)
+{
+	int error;
+
+	error = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					  len, false);
+	if (error)
+		return error;
+	return len;
+}
+
+static int ocfs2_file_clone_range(struct file *file_in,
+				  loff_t pos_in,
+				  struct file *file_out,
+				  loff_t pos_out,
+				  u64 len)
+{
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					 len, false);
+}
+
+#define OCFS2_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+static ssize_t ocfs2_file_dedupe_range(struct file *src_file,
+				       u64 loff,
+				       u64 len,
+				       struct file *dst_file,
+				       u64 dst_loff)
+{
+	int error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > OCFS2_MAX_DEDUPE_LEN)
+		len = OCFS2_MAX_DEDUPE_LEN;
+
+	error = ocfs2_reflink_remap_range(src_file, loff, dst_file, dst_loff,
+					  len, true);
+	if (error)
+		return error;
+	return len;
+}
+
 const struct inode_operations ocfs2_file_iops = {
 	.setattr	= ocfs2_setattr,
 	.getattr	= ocfs2_getattr,
@@ -2478,6 +2528,9 @@ const struct file_operations ocfs2_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops = {
@@ -2523,6 +2576,9 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
+	.copy_file_range = ocfs2_file_copy_range,
+	.clone_file_range = ocfs2_file_clone_range,
+	.dedupe_file_range = ocfs2_file_dedupe_range,
 };
 
 const struct file_operations ocfs2_dops_no_plocks = {
diff --git a/fs/ocfs2/file.h b/fs/ocfs2/file.h
index e8c62f2..897fd9a 100644
--- a/fs/ocfs2/file.h
+++ b/fs/ocfs2/file.h
@@ -82,4 +82,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 
 int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
 				   size_t count);
+int ocfs2_remove_inode_range(struct inode *inode,
+			     struct buffer_head *di_bh, u64 byte_start,
+			     u64 byte_len);
 #endif /* OCFS2_FILE_H */
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 6c98d56..be51540 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -34,6 +34,7 @@
 #include "xattr.h"
 #include "namei.h"
 #include "ocfs2_trace.h"
+#include "file.h"
 
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -4441,3 +4442,629 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 
 	return error;
 }
+
+/* Update destination inode size, if necessary. */
+static int ocfs2_reflink_update_dest(struct inode *dest,
+				     struct buffer_head *d_bh,
+				     loff_t newlen)
+{
+	handle_t *handle;
+	int ret;
+
+	dest->i_blocks = ocfs2_inode_sector_count(dest);
+
+	if (newlen <= i_size_read(dest))
+		return 0;
+
+	handle = ocfs2_start_trans(OCFS2_SB(dest->i_sb),
+				   OCFS2_INODE_UPDATE_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		mlog_errno(ret);
+		return ret;
+	}
+
+	/* Extend i_size if needed. */
+	spin_lock(&OCFS2_I(dest)->ip_lock);
+	if (newlen > i_size_read(dest))
+		i_size_write(dest, newlen);
+	spin_unlock(&OCFS2_I(dest)->ip_lock);
+	dest->i_ctime = dest->i_mtime = current_time(dest);
+
+	ret = ocfs2_mark_inode_dirty(handle, dest, d_bh);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_commit;
+	}
+
+out_commit:
+	ocfs2_commit_trans(OCFS2_SB(dest->i_sb), handle);
+	return ret;
+}
+
+/* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
+static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len,
+				      struct ocfs2_cached_dealloc_ctxt *dealloc)
+{
+	struct ocfs2_extent_tree s_et;
+	struct ocfs2_extent_tree t_et;
+	struct ocfs2_dinode *dis;
+	struct buffer_head *ref_root_bh = NULL;
+	struct ocfs2_refcount_tree *ref_tree;
+	struct ocfs2_super *osb;
+	loff_t pstart, plen;
+	u32 p_cluster, num_clusters, slast, spos, tpos;
+	unsigned int ext_flags;
+	int ret = 0;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	ocfs2_init_dinode_extent_tree(&s_et, INODE_CACHE(s_inode), s_bh);
+	ocfs2_init_dinode_extent_tree(&t_et, INODE_CACHE(t_inode), t_bh);
+
+	spos = ocfs2_bytes_to_clusters(s_inode->i_sb, pos_in);
+	tpos = ocfs2_bytes_to_clusters(t_inode->i_sb, pos_out);
+	slast = ocfs2_clusters_for_bytes(s_inode->i_sb, pos_in + len);
+
+	while (spos < slast) {
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
+
+		/* Look up the extent. */
+		ret = ocfs2_get_clusters(s_inode, spos, &p_cluster,
+					 &num_clusters, &ext_flags);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		num_clusters = min_t(u32, num_clusters, slast - spos);
+
+		/* Punch out the dest range. */
+		pstart = ocfs2_clusters_to_bytes(t_inode->i_sb, tpos);
+		plen = ocfs2_clusters_to_bytes(t_inode->i_sb, num_clusters);
+		ret = ocfs2_remove_inode_range(t_inode, t_bh, pstart, plen);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		if (p_cluster == 0)
+			goto next_loop;
+
+		/* Lock the refcount btree... */
+		ret = ocfs2_lock_refcount_tree(osb,
+					       le64_to_cpu(dis->i_refcount_loc),
+					       1, &ref_tree, &ref_root_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+
+		/* Mark s_inode's extent as refcounted. */
+		if (!(ext_flags & OCFS2_EXT_REFCOUNTED)) {
+			ret = ocfs2_add_refcount_flag(s_inode, &s_et,
+						      &ref_tree->rf_ci,
+						      ref_root_bh, spos,
+						      p_cluster, num_clusters,
+						      dealloc, NULL);
+			if (ret) {
+				mlog_errno(ret);
+				goto out_unlock_refcount;
+			}
+		}
+
+		/* Map in the new extent. */
+		ext_flags |= OCFS2_EXT_REFCOUNTED;
+		ret = ocfs2_add_refcounted_extent(t_inode, &t_et,
+						  &ref_tree->rf_ci,
+						  ref_root_bh,
+						  tpos, p_cluster,
+						  num_clusters,
+						  ext_flags,
+						  dealloc);
+		if (ret) {
+			mlog_errno(ret);
+			goto out_unlock_refcount;
+		}
+
+		ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+		brelse(ref_root_bh);
+next_loop:
+		spos += num_clusters;
+		tpos += num_clusters;
+	}
+
+out:
+	return ret;
+out_unlock_refcount:
+	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
+	brelse(ref_root_bh);
+	return ret;
+}
+
+/* Set up refcount tree and remap s_inode to t_inode. */
+static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+				      struct buffer_head *s_bh,
+				      loff_t pos_in,
+				      struct inode *t_inode,
+				      struct buffer_head *t_bh,
+				      loff_t pos_out,
+				      loff_t len)
+{
+	struct ocfs2_cached_dealloc_ctxt dealloc;
+	struct ocfs2_super *osb;
+	struct ocfs2_dinode *dis;
+	struct ocfs2_dinode *dit;
+	int ret;
+
+	osb = OCFS2_SB(s_inode->i_sb);
+	dis = (struct ocfs2_dinode *)s_bh->b_data;
+	dit = (struct ocfs2_dinode *)t_bh->b_data;
+	ocfs2_init_dealloc_ctxt(&dealloc);
+
+	/*
+	 * If we're reflinking the entire file and the source is inline
+	 * data, just copy the contents.
+	 */
+	if (pos_in == pos_out && pos_in == 0 && len == i_size_read(s_inode) &&
+	    i_size_read(t_inode) <= len &&
+	    (OCFS2_I(s_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)) {
+		ret = ocfs2_duplicate_inline_data(s_inode, s_bh, t_inode, t_bh);
+		if (ret)
+			mlog_errno(ret);
+		goto out;
+	}
+
+	/*
+	 * If both inodes belong to two different refcount groups then
+	 * forget it because we don't know how (or want) to go merging
+	 * refcount trees.
+	 */
+	ret = -EOPNOTSUPP;
+	if (ocfs2_is_refcount_inode(s_inode) &&
+	    ocfs2_is_refcount_inode(t_inode) &&
+	    le64_to_cpu(dis->i_refcount_loc) !=
+	    le64_to_cpu(dit->i_refcount_loc))
+		goto out;
+
+	/* Neither inode has a refcount tree.  Add one to s_inode. */
+	if (!ocfs2_is_refcount_inode(s_inode) &&
+	    !ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_create_refcount_tree(s_inode, s_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Ensure that both inodes end up with the same refcount tree. */
+	if (!ocfs2_is_refcount_inode(s_inode)) {
+		ret = ocfs2_set_refcount_tree(s_inode, s_bh,
+					      le64_to_cpu(dit->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+	if (!ocfs2_is_refcount_inode(t_inode)) {
+		ret = ocfs2_set_refcount_tree(t_inode, t_bh,
+					      le64_to_cpu(dis->i_refcount_loc));
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Turn off inline data in the dest file. */
+	if (OCFS2_I(t_inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+		ret = ocfs2_convert_inline_data_to_extents(t_inode, t_bh);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+	}
+
+	/* Actually remap extents now. */
+	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
+					 pos_out, len, &dealloc);
+	if (ret) {
+		mlog_errno(ret);
+		goto out;
+	}
+
+out:
+	if (ocfs2_dealloc_has_cluster(&dealloc)) {
+		ocfs2_schedule_truncate_log_flush(osb, 1);
+		ocfs2_run_deallocs(osb, &dealloc);
+	}
+
+	return ret;
+}
+
+/* Lock an inode and grab a bh pointing to the inode. */
+static int ocfs2_reflink_inodes_lock(struct inode *s_inode,
+				     struct buffer_head **bh1,
+				     struct inode *t_inode,
+				     struct buffer_head **bh2)
+{
+	struct inode *inode1;
+	struct inode *inode2;
+	struct ocfs2_inode_info *oi1;
+	struct ocfs2_inode_info *oi2;
+	bool same_inode = (s_inode == t_inode);
+	int status;
+
+	/* First grab the VFS and rw locks. */
+	inode1 = s_inode;
+	inode2 = t_inode;
+	if (inode1->i_ino > inode2->i_ino)
+		swap(inode1, inode2);
+
+	inode_lock(inode1);
+	status = ocfs2_rw_lock(inode1, 1);
+	if (status) {
+		mlog_errno(status);
+		goto out_i1;
+	}
+	if (!same_inode) {
+		inode_lock_nested(inode2, I_MUTEX_CHILD);
+		status = ocfs2_rw_lock(inode2, 1);
+		if (status) {
+			mlog_errno(status);
+			goto out_i2;
+		}
+	}
+
+	/* Now go for the cluster locks */
+	oi1 = OCFS2_I(inode1);
+	oi2 = OCFS2_I(inode2);
+
+	trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
+				(unsigned long long)oi2->ip_blkno);
+
+	if (*bh1)
+		*bh1 = NULL;
+	if (*bh2)
+		*bh2 = NULL;
+
+	/* We always want to lock the one with the lower lockid first. */
+	if (oi1->ip_blkno > oi2->ip_blkno)
+		mlog_errno(-ENOLCK);
+
+	/* lock id1 */
+	status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
+	if (status < 0) {
+		if (status != -ENOENT)
+			mlog_errno(status);
+		goto out_rw2;
+	}
+
+	/* lock id2 */
+	if (!same_inode) {
+		status = ocfs2_inode_lock_nested(inode2, bh2, 1,
+						 OI_LS_REFLINK_TARGET);
+		if (status < 0) {
+			if (status != -ENOENT)
+				mlog_errno(status);
+			goto out_cl1;
+		}
+	} else
+		*bh2 = *bh1;
+
+	trace_ocfs2_double_lock_end(
+			(unsigned long long)OCFS2_I(inode1)->ip_blkno,
+			(unsigned long long)OCFS2_I(inode2)->ip_blkno);
+
+	return 0;
+
+out_cl1:
+	ocfs2_inode_unlock(inode1, 1);
+	brelse(*bh1);
+	*bh1 = NULL;
+out_rw2:
+	ocfs2_rw_unlock(inode2, 1);
+out_i2:
+	inode_unlock(inode2);
+	ocfs2_rw_unlock(inode1, 1);
+out_i1:
+	inode_unlock(inode1);
+	return status;
+}
+
+/* Unlock both inodes and release buffers. */
+static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
+					struct buffer_head *s_bh,
+					struct inode *t_inode,
+					struct buffer_head *t_bh)
+{
+	ocfs2_inode_unlock(s_inode, 1);
+	ocfs2_rw_unlock(s_inode, 1);
+	inode_unlock(s_inode);
+	brelse(s_bh);
+
+	if (s_inode == t_inode)
+		return;
+
+	ocfs2_inode_unlock(t_inode, 1);
+	ocfs2_rw_unlock(t_inode, 1);
+	inode_unlock(t_inode);
+	brelse(t_bh);
+}
+
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *ocfs2_reflink_get_page(struct inode *inode,
+					   loff_t offset)
+{
+	struct address_space *mapping;
+	struct page *page;
+	pgoff_t n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int ocfs2_reflink_compare_extents(struct inode *src,
+					 loff_t srcoff,
+					 struct inode *dest,
+					 loff_t destoff,
+					 loff_t len,
+					 bool *is_same)
+{
+	loff_t src_poff;
+	loff_t dest_poff;
+	void *src_addr;
+	void *dest_addr;
+	struct page *src_page;
+	struct page *dest_page;
+	loff_t cmp_len;
+	bool same;
+	int error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		if (cmp_len <= 0) {
+			mlog_errno(-EUCLEAN);
+			goto out_error;
+		}
+
+		src_page = ocfs2_reflink_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = ocfs2_reflink_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Link a range of blocks from one file to another. */
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
+	struct buffer_head *in_bh = NULL, *out_bh = NULL;
+	loff_t bs = 1 << OCFS2_SB(inode_in->i_sb)->s_clustersize_bits;
+	bool same_inode = (inode_in == inode_out);
+	bool is_same = false;
+	loff_t isize;
+	ssize_t ret;
+	loff_t blen;
+
+	if (!ocfs2_refcount_tree(osb))
+		return -EOPNOTSUPP;
+	if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+		return -EROFS;
+
+	/* Lock both files against IO */
+	ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
+	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
+		goto out_unlock;
+
+	/* Don't touch certain kinds of inodes */
+	ret = -EPERM;
+	if (IS_IMMUTABLE(inode_out))
+		goto out_unlock;
+
+	ret = -ETXTBSY;
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		goto out_unlock;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	ret = -EISDIR;
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		goto out_unlock;
+	ret = -EINVAL;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		goto out_unlock;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		goto out_unlock;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		goto out_unlock;
+
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			goto out_unlock;
+	}
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		goto out_unlock;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode) {
+		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
+			goto out_unlock;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode_in);
+	if (!same_inode)
+		inode_dio_wait(inode_out);
+
+	ret = filemap_write_and_wait_range(inode_in->i_mapping,
+			pos_in, pos_in + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	ret = filemap_write_and_wait_range(inode_out->i_mapping,
+			pos_out, pos_out + len - 1);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (is_dedupe) {
+		ret = ocfs2_reflink_compare_extents(inode_in, pos_in,
+						    inode_out, pos_out,
+						    len, &is_same);
+		if (ret)
+			goto out_unlock;
+		if (!is_same) {
+			ret = -EBADE;
+			goto out_unlock;
+		}
+	}
+
+	/* Lock out changes to the allocation maps */
+	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
+				  SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
+
+	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+	if (!same_inode)
+		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	/*
+	 * Empty the extent map so that we may get the right extent
+	 * record from the disk.
+	 */
+	ocfs2_extent_map_trunc(inode_in, 0);
+	ocfs2_extent_map_trunc(inode_out, 0);
+
+	ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
+	if (ret) {
+		mlog_errno(ret);
+		goto out_unlock;
+	}
+
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return 0;
+
+out_unlock:
+	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+	return ret;
+}
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 6422bbc..4af55bf 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -115,4 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
+int ocfs2_reflink_remap_range(struct file *file_in,
+			      loff_t pos_in,
+			      struct file *file_out,
+			      loff_t pos_out,
+			      u64 len,
+			      bool is_dedupe);
+
 #endif /* OCFS2_REFCOUNTTREE_H */

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range
  2016-11-11  3:15   ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Eric Ren
@ 2016-11-11 15:05     ` Darrick J. Wong
  -1 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11 15:05 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 11:15:57AM +0800, Eric Ren wrote:
> Hi,
> 
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Hi all,
> >
> >These patches wire up the existing ocfs2 reflinking capabilities to
> >the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
> >patches clean up some minor bugs that I found; the last kernel patch
> >contains the new code.
> >
> >A few minor fixes to xfstests are needed to make more of the tests
> >run.  I'll tack that patch on the end.
> 
> FYI, reflink testcases from ocfs2-test both on single and multiple node(s)
> all passed with your patches. At least, it shows that no obvious regression issue
> is observed so far ;-)

Heh, good. :)

The v2 patch contains some fixes for a few things I thought of last night
that don't have xfstests yet.

I /think/ the locking is ok, but that could use some review. :)

--D

> 
> Eric
> >
> >--D
> >
> >[1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range
@ 2016-11-11 15:05     ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-11 15:05 UTC (permalink / raw)
  To: Eric Ren; +Cc: mfasheh, jlbec, linux-fsdevel, ocfs2-devel

On Fri, Nov 11, 2016 at 11:15:57AM +0800, Eric Ren wrote:
> Hi,
> 
> On 11/10/2016 06:51 AM, Darrick J. Wong wrote:
> >Hi all,
> >
> >These patches wire up the existing ocfs2 reflinking capabilities to
> >the new(ish) VFS {copy,clone,dedupe}_range interface.  The first few
> >patches clean up some minor bugs that I found; the last kernel patch
> >contains the new code.
> >
> >A few minor fixes to xfstests are needed to make more of the tests
> >run.  I'll tack that patch on the end.
> 
> FYI, reflink testcases from ocfs2-test both on single and multiple node(s)
> all passed with your patches. At least, it shows that no obvious regression issue
> is observed so far ;-)

Heh, good. :)

The v2 patch contains some fixes for a few things I thought of last night
that don't have xfstests yet.

I /think/ the locking is ok, but that could use some review. :)

--D

> 
> Eric
> >
> >--D
> >
> >[1] https://github.com/djwong/linux/tree/ocfs2-vfs-reflink
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >the body of a message to majordomo at vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-11-11 15:05 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-09 22:51 [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range Darrick J. Wong
2016-11-09 22:51 ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong
2016-11-09 22:51 ` [PATCH 1/6] ocfs2: convert inode refcount test to a helper Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-10  2:14   ` Eric Ren
2016-11-10  2:14     ` [Ocfs2-devel] " Eric Ren
2016-11-10 17:51     ` Darrick J. Wong
2016-11-10 17:51       ` [Ocfs2-devel] " Darrick J. Wong
2016-11-10 17:52   ` [PATCH v2 " Darrick J. Wong
2016-11-10 17:52     ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 22:51 ` [PATCH 2/6] ocfs2: add newlines to some error messages Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 22:51 ` [PATCH 3/6] ocfs2: prohibit refcounted swapfiles Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 22:51 ` [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-10  9:20   ` Darwin
2016-11-10  9:20     ` Darwin
2016-11-10 17:11     ` Darrick J. Wong
2016-11-10 17:11       ` Darrick J. Wong
2016-11-11  3:00       ` Darwin
2016-11-11  3:00         ` Darwin
2016-11-09 22:51 ` [PATCH 5/6] ocfs2: don't eat io errors during _dio_end_io_write Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 22:51 ` [PATCH 6/6] ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features Darrick J. Wong
2016-11-09 22:51   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-11  5:49   ` Eric Ren
2016-11-11  5:49     ` Eric Ren
2016-11-11  6:20     ` Darrick J. Wong
2016-11-11  6:20       ` Darrick J. Wong
2016-11-11  6:45       ` Eric Ren
2016-11-11  6:45         ` Eric Ren
2016-11-11  9:01         ` Darrick J. Wong
2016-11-11  9:01           ` Darrick J. Wong
2016-11-11 14:54   ` [PATCH v2 " Darrick J. Wong
2016-11-11 14:54     ` [Ocfs2-devel] " Darrick J. Wong
2016-11-09 23:00 ` [PATCH 7/6] xfstests: fix some minor problems testing ocfs2 Darrick J. Wong
2016-11-09 23:00   ` [Ocfs2-devel] " Darrick J. Wong
2016-11-11  3:15 ` [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range Eric Ren
2016-11-11  3:15   ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Eric Ren
2016-11-11 15:05   ` [PATCH 0/6] ocfs2: wire up {clone,copy,dedupe}_range Darrick J. Wong
2016-11-11 15:05     ` [Ocfs2-devel] [PATCH 0/6] ocfs2: wire up {clone, copy, dedupe}_range Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.