All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goldwyn Rodrigues <rgoldwyn@suse.de>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 11/15] fs: dedup file range to use a compare function
Date: Mon, 1 Apr 2019 15:36:04 -0500	[thread overview]
Message-ID: <20190401203604.w4xzvhb2vklxxrao@merlin> (raw)
In-Reply-To: <20190328170440.GH1172@magnolia>

On 10:04 28/03, Darrick J. Wong wrote:
> On Tue, Mar 26, 2019 at 02:02:57PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > With dax we cannot deal with readpage() etc. So, we create a
> > funciton callback to perform the file data comparison and pass
> > it to generic_remap_file_range_prep() so it can use iomap-based
> > functions.
> 
> So it occurs to me -- is there an intrinsic requirement that all files
> sharing blocks must be in DAX mode or not-DAX mode?  Or in other words,
> if I run cp --reflink a b, then it cannot be the case that IS_DAX(a) !=
> IS_DAX(b), right?
> 
> Will there be calamitous consequences if dax and page cache both point
> to the same piece of storage?  I would imagine so...

Yes, data corruptions!

> 
> > This may not be the best way to solve this. Suggestions welcome.
> 
> Agree.  There's a fair amount of code duplication between the dax and
> pagecache compare functions, considering that the loop structure for
> both cases is:
> 
> while (len) {
> 	/* Figure out minimum comparison length */
> 
> 	/* Map source file range into memory */
> 
> 	/* Map dest file range into memory */
> 
> 	/* Compare memory */
> 
> 	/* Release dest file memory */
> 
> 	/* Release source file memory */
> 
> 	/* Update counters or break out */
> }
> 
> The current vfs_dedupe_file_range_compare could use some improvements,
> at least for filesystems where we actually support iomap.  In
> particular, if you try to dedupe into or out of a hole, it'll flood the
> page cache with zero pages, which is pretty wasteful if we could've
> found out that it's actually a hole.
> 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/btrfs/ctree.h     |  3 +++
> >  fs/btrfs/dax.c       |  7 +++++++
> >  fs/btrfs/ioctl.c     | 13 +++++++++++-
> >  fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/ocfs2/file.c      |  2 +-
> >  fs/read_write.c      |  9 +++++---
> >  fs/xfs/xfs_reflink.c |  2 +-
> >  include/linux/dax.h  |  2 ++
> >  include/linux/fs.h   |  4 +++-
> >  9 files changed, 93 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 0e5060933bde..750f9c70fabe 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3803,6 +3803,9 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
> >  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> >  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> > +		struct inode *dest, loff_t destoff, loff_t len,
> > +		bool *is_same);
> >  #else
> >  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> >  {
> > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> > index 927f962d1e88..9488cae0f8b4 100644
> > --- a/fs/btrfs/dax.c
> > +++ b/fs/btrfs/dax.c
> > @@ -168,4 +168,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
> >  
> >  	return ret;
> >  }
> > +
> > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> > +		struct inode *dest, loff_t destoff, loff_t len,
> > +		bool *is_same)
> > +{
> > +	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
> > +}
> >  #endif /* CONFIG_FS_DAX */
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index e66426e7692d..2e5137b01561 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3990,8 +3990,19 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  	if (ret < 0)
> >  		goto out_unlock;
> >  
> > +#ifdef CONFIG_FS_DAX
> > +	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
> > +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> > +				pos_out, len, remap_flags,
> > +				btrfs_dax_file_range_compare);
> > +	else
> > +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> > +				pos_out, len, remap_flags, NULL);
> > +#else
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -					    len, remap_flags);
> > +						    len, remap_flags, NULL);
> > +#endif
> > +
> >  	if (ret < 0 || *len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 41061da42771..18998c5ee27a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1775,3 +1775,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > +
> > +
> > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> > +		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
> 
> Comments for this function assume this is a vfs generic function, not a
> dax specific thing...
> 
> > +{
> > +	void *saddr, *daddr;
> > +	struct iomap s_iomap = {0};
> > +	struct iomap d_iomap = {0};
> > +	loff_t dstart, sstart;
> > +	bool same = true;
> > +	loff_t cmp_len, l;
> > +	int id, ret = 0;
> > +
> > +	id = dax_read_lock();
> > +	while (len) {
> > +		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
> > +		if (ret < 0) {
> > +			if (ops->iomap_end)
> > +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> > +			return ret;
> > +		}
> 
> Shouldn't we be checking that the iomap actually points to written
> storage?  As opposed to hole/delalloc/unwritten?
> 
> Bonus: not-written extents (hole/unwritten) could be optimized a bit,
> though I don't know if it's worth it for something that's probably a
> corner case.

Yes, I think this needs to be optimized. However, from the VFS API end.
As you mention above, it fills the pagecache unnecessarily. I feel file
comparison is best handled by individual filesystems as opposed to
the VFS.

> 
> > +		cmp_len = len;
> > +		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
> > +			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
> > +
> > +		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
> > +		if (ret < 0) {
> > +			if (ops->iomap_end) {
> > +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> > +				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
> > +			}
> > +			return ret;
> > +		}
> > +		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
> > +			cmp_len = d_iomap.offset + d_iomap.length - destoff;
> > +
> 
> If you wanted to make this a generic function, it would be kinda nice if
> we could switch between grabbing page cache for non-DAX files and
> dax_direct_access for DAX files.
> 
> (Actually, no, that's kind of ugly too.)
> 
> Perhaps iomap.c needs to grow a function to iterate ranges of two files
> and call an actor function on both sets?  And we can change the actor to
> be the pagecache-grabbing one or the dax-access one depending on the DAX
> mode.
> 

Yes, that would work well.

> Though I guess that leaves poor old ocfs2 in the dust since it doesn't
> support iomap.  I guess it wouldn't be too hard to adapt it to have an
> iomap ops for reads only.
> 
> > +
> > +		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
> > +		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
> > +		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
> > +		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
> > +		same = !memcmp(saddr, daddr, cmp_len);
> > +		if (!same)
> > +			break;
> > +		len -= cmp_len;
> > +		srcoff += cmp_len;
> > +		destoff += cmp_len;
> > +
> > +		if (ops->iomap_end) {
> > +			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
> > +			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
> > +		}
> > +	}
> > +	dax_read_unlock(id);
> > +	*is_same = same;
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_file_range_compare);
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index d640c5f8a85d..6bf3e8fbb016 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
> >  		goto out_unlock;
> >  
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -			&len, remap_flags);
> > +			&len, remap_flags, NULL);
> >  	if (ret < 0 || len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 177ccc3d405a..da521a221213 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1855,7 +1855,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> >   */
> >  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  				  struct file *file_out, loff_t pos_out,
> > -				  loff_t *len, unsigned int remap_flags)
> > +				  loff_t *len, unsigned int remap_flags, compare_range_t compare)
> 
> Line wrapping...
> 
> >  {
> >  	struct inode *inode_in = file_inode(file_in);
> >  	struct inode *inode_out = file_inode(file_out);
> > @@ -1914,8 +1914,11 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  	 */
> >  	if (remap_flags & REMAP_FILE_DEDUP) {
> >  		bool		is_same = false;
> > -
> > -		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> > +		if (compare)
> > +			ret = compare(inode_in, pos_in,
> > +				inode_out, pos_out, *len, &is_same);
> > +		else
> > +			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> >  				inode_out, pos_out, *len, &is_same);
> 
> Make the callers pass in vfs_dedupe_file_range_compare instead of NULL,
> and avoid this if clause stuff.

Got it. Thanks.

> 
> --D
> 
> >  		if (ret)
> >  			return ret;
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 680ae7662a78..8907c7aa3f19 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
> >  		goto out_unlock;
> >  
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -			len, remap_flags);
> > +			len, remap_flags, NULL);
> >  	if (ret < 0 || *len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 0dd316a74a29..a11bc7b1f526 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> >  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> >  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> >  				      pgoff_t index);
> > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> > +                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
> >  
> >  #ifdef CONFIG_FS_DAX
> >  int __dax_zero_page_range(struct block_device *bdev,
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b42df09b04c..22fe4324b22e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1880,10 +1880,12 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> >  		unsigned long, loff_t *, rwf_t);
> >  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> >  				   loff_t, size_t, unsigned int);
> > +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
> > +		loff_t destpos, loff_t len, bool *is_same);
> >  extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  					 struct file *file_out, loff_t pos_out,
> >  					 loff_t *count,
> > -					 unsigned int remap_flags);
> > +					 unsigned int remap_flags, compare_range_t cmp);
> >  extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> >  				  struct file *file_out, loff_t pos_out,
> >  				  loff_t len, unsigned int remap_flags);
> > -- 
> > 2.16.4
> > 
> 

-- 
Goldwyn

  reply	other threads:[~2019-04-01 20:36 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-03-26 19:10   ` Matthew Wilcox
2019-03-27 11:00     ` Goldwyn Rodrigues
2019-03-27 12:00       ` Matthew Wilcox
2019-03-27 12:26         ` Goldwyn Rodrigues
2019-03-27 23:31         ` Goldwyn Rodrigues
2019-03-27 17:38     ` Adam Borowski
2019-03-28 14:49   ` David Sterba
2019-03-28 17:28   ` David Sterba
2019-03-28 17:57     ` Darrick J. Wong
2019-04-01 20:43     ` Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 03/15] btrfs: basic dax read Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
2019-03-27 17:54   ` Darrick J. Wong
2019-03-27 18:58     ` Goldwyn Rodrigues
2019-03-28 14:45       ` Darrick J. Wong
2019-04-01  4:38   ` Dave Chinner
2019-04-01 21:41     ` Goldwyn Rodrigues
2019-04-01 23:06       ` Dave Chinner
2019-04-03  1:56         ` Goldwyn Rodrigues
2019-04-03  3:20           ` Dave Chinner
2019-04-07  7:26     ` Christoph Hellwig
2019-03-26 19:02 ` [PATCH 05/15] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-03-31 18:42   ` Nikolay Borisov
2019-03-26 19:02 ` [PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 07/15] btrfs: add dax write support Goldwyn Rodrigues
2019-03-28 14:53   ` Darrick J. Wong
2019-04-01 20:39     ` Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing Goldwyn Rodrigues
2019-03-28 15:41   ` Darrick J. Wong
2019-03-26 19:02 ` [PATCH 09/15] btrfs: add dax mmap support Goldwyn Rodrigues
2019-03-28 15:45   ` Darrick J. Wong
2019-03-26 19:02 ` [PATCH 10/15] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 11/15] fs: dedup file range to use a compare function Goldwyn Rodrigues
2019-03-28 17:04   ` Darrick J. Wong
2019-04-01 20:36     ` Goldwyn Rodrigues [this message]
2019-03-26 19:02 ` [PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 13/15] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-03-26 19:03 ` [PATCH 14/15] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-03-26 19:03 ` [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-03-28 15:48   ` Darrick J. Wong
2019-03-26 19:09 ` [PATCH v2 00/15] btrfs dax support Goldwyn Rodrigues
2019-03-27 20:14   ` Adam Borowski
2019-03-27 23:26     ` Goldwyn Rodrigues
2019-03-28 10:24       ` [PATCH] btrfs: allow MAP_SYNC mmap Adam Borowski
2019-03-28 10:42         ` Adam Borowski
2019-04-01 20:08         ` Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190401203604.w4xzvhb2vklxxrao@merlin \
    --to=rgoldwyn@suse.de \
    --cc=darrick.wong@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.