All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Biggers <ebiggers3@gmail.com>
Cc: mtk.manpages@gmail.com, david@fromorbit.com,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	xfs@oss.sgi.com, linux-btrfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
Date: Mon, 8 Aug 2016 03:47:34 +1000	[thread overview]
Message-ID: <594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com> (raw)
In-Reply-To: <20160112091432.GB7832@birch.djwong.org>

Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

WARNING: multiple messages have this Message-ID (diff)
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Biggers <ebiggers3@gmail.com>
Cc: linux-api@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org,
	mtk.manpages@gmail.com, linux-fsdevel@vger.kernel.org,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
Date: Mon, 8 Aug 2016 03:47:34 +1000	[thread overview]
Message-ID: <594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com> (raw)
In-Reply-To: <20160112091432.GB7832@birch.djwong.org>

Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2016-08-07 17:47 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-19  8:55 [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Darrick J. Wong
2015-12-19  8:55 ` Darrick J. Wong
2015-12-19  8:55 ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 2/9] x86: add sys_copy_file_range to syscall tables Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 3/9] btrfs: add .copy_file_range file operation Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 5/9] locks: new locks_mandatory_area calling convention Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 6/9] vfs: pull btrfs clone API to vfs layer Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2016-01-12  6:07   ` Eric Biggers
2016-01-12  6:07     ` Eric Biggers
2016-01-12  9:14     ` Darrick J. Wong
2016-01-12  9:14       ` Darrick J. Wong
2016-01-13  2:36       ` Eric Biggers
2016-01-13  2:36         ` Eric Biggers
2016-01-13  2:36         ` Eric Biggers
2016-01-23  0:54         ` Darrick J. Wong
2016-01-23  0:54           ` Darrick J. Wong
2016-01-23  0:54           ` Darrick J. Wong
2016-08-07 17:47       ` Michael Kerrisk (man-pages) [this message]
2016-08-07 17:47         ` Michael Kerrisk (man-pages)
2016-07-27 21:51   ` Kirill A. Shutemov
2016-07-27 21:51     ` Kirill A. Shutemov
2016-07-28 18:07     ` Darrick J. Wong
2016-07-28 18:07       ` Darrick J. Wong
2016-07-28 18:07       ` Darrick J. Wong
2016-07-28 19:25     ` Darrick J. Wong
2016-07-28 19:25       ` Darrick J. Wong
2015-12-19  8:56 ` [PATCH 9/9] btrfs: use new dedupe data function pointer Darrick J. Wong
2015-12-19  8:56   ` Darrick J. Wong
2015-12-20 15:30 ` [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Christoph Hellwig
2015-12-20 15:30   ` Christoph Hellwig
2015-12-20 15:30   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=594f4812-e146-6a6b-60d5-91ac56d46d66@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=ebiggers3@gmail.com \
    --cc=hch@infradead.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.