From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Dilger Subject: Re: [PATCH 1/3] fs: Document the reflink(2) system call. Date: Tue, 05 May 2009 15:24:17 -0600 Message-ID: <20090505212417.GO3209@webber.adilger.int> References: <1241331303-23753-1-git-send-email-joel.becker@oracle.com> <1241331303-23753-2-git-send-email-joel.becker@oracle.com> <20090505010703.GA12731@shareable.org> <20090505071608.GB10258@mail.oracle.com> <20090505080936.GG3209@webber.adilger.int> <20090505165628.GC7835@mail.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7BIT To: Jamie Lokier , linux-fsdevel@vger.kernel.org, jmorris@namei.org, ocfs2-devel@oss.oracle.com, viro@zeniv.linux.org.uk Return-path: Received: from sca-es-mail-2.Sun.COM ([192.18.43.133]:38265 "EHLO sca-es-mail-2.sun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756690AbZEEVYh (ORCPT ); Tue, 5 May 2009 17:24:37 -0400 Received: from fe-sfbay-10.sun.com ([192.18.43.129]) by sca-es-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n45LObB0026568 for ; Tue, 5 May 2009 14:24:37 -0700 (PDT) Content-disposition: inline Received: from conversion-daemon.fe-sfbay-10.sun.com by fe-sfbay-10.sun.com (Sun Java(tm) System Messaging Server 7.0-5.01 64bit (built Feb 19 2009)) id <0KJ600L00WKLV200@fe-sfbay-10.sun.com> for linux-fsdevel@vger.kernel.org; Tue, 05 May 2009 14:24:37 -0700 (PDT) In-reply-to: <20090505165628.GC7835@mail.oracle.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On May 05, 2009 09:56 -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote: > > If the reflink caller is always charged for the full space used (as if > > it were a real copy) by virtue of the user doing the reflink() owning the > > new inode. Doing anything else seems broken. If the owner of the file > > wasn't charged for the reflink's quota then if the reflink inode was > > chowned the new owner would be charged for the new file, but the quota > > code would have to special case the decrement of EACH of the reflink's > > blocks because otherwise the original owner might "release" quota that > > it was never originally charged. > > If the caller is creating an inode in someone else's name, then > who do you charge for the quota? IMHO, it shouldn't be possible to create an inode in someone else's name (CAP_* excluded), just like it isn't possible to create a new file in someone elses name. The caller of reflink() should be the one creating the file, hence the owner of the file, and the owner of the quota. > If you charge the caller, how do you know to decrement the caller's > quota when the actual owner does truncate, given that the inode has > no knowledge of the caller anymore. No, if the owner of the inode (== caller) is charged the quota then when the inode is truncated (regardless of who does the truncate) the quota will just work correctly. > You've hit the nail on the head - without backrefs for each > refcounted hunk, you can't figure out who it owns it from a quota > perspective. And that's just a non-starter to try and maintain. No, I don't think my proposal is _more_ complex than the original. It is actually _less_ complex, because the fact that this is a reflink and not a complete file copy is a purely internal detail of the filesystem and is not exposed outside the filesystem. The fact that a reflink consumes less space and is faster than a real copy is an implementation detail, not really any different than if the file were compressed by the filesystem internally. > > > Here's another fun trick. Overwriting rsync, instead of copying > > > blocks from the already-existing source could reflink the source to the > > > .temporary, then only write the changed blocks. And since you own both > > > files, it just works. If you're overwriting someone else's file? The > > > old copy behavior is fine. > > > > Well, "fine" as in it works, but if there are only a few changed blocks, > > and the old copy is now part of a snapshot (so it won't be released when > > rsync is finished) the space consumption has doubled instead of just > > using a few extra blocks. > > No, because the last thing rsync will do is rename(.temporary, > source). All the references from the source will be decremented, and > any blocks only owned by the source will be freed. Space usage is > identical before and after, like a copying rsync, but there is less > space used and less I/O done during the rsync process. What I was objecting to is "when overwriting someone elses file, the old copy behaviour is fine". If we are implementing a copy-on-write API, why hamstring it to not work in the expected manner by a normal "cp"? > > Is there anything about changing the owner/group of the new inode during > > reflink that makes the implementation more complex? If the process doing > > the reflink is the same as the file owner then the semantics are unchanged > > from what you have proposed. > > If you define that 'reflink sets the attributes as if it was a > new file', then you should be creating the file with a new security > context, not with the security context from the existing inode. And > then you can't really snapshot. > A mixed behavior, like "if you own it, I'll preserve the entire > security context, but if not I will treat it with a new context" is > confusing at best. I don't find it confusing. The security context would be inherited from the creating process, just like creating a new file would. If it is the same user as the file owner then the security context will be the same. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Dilger Date: Tue, 05 May 2009 15:24:17 -0600 Subject: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. In-Reply-To: <20090505165628.GC7835@mail.oracle.com> References: <1241331303-23753-1-git-send-email-joel.becker@oracle.com> <1241331303-23753-2-git-send-email-joel.becker@oracle.com> <20090505010703.GA12731@shareable.org> <20090505071608.GB10258@mail.oracle.com> <20090505080936.GG3209@webber.adilger.int> <20090505165628.GC7835@mail.oracle.com> Message-ID: <20090505212417.GO3209@webber.adilger.int> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Jamie Lokier , linux-fsdevel@vger.kernel.org, jmorris@namei.org, ocfs2-devel@oss.oracle.com, viro@zeniv.linux.org.uk On May 05, 2009 09:56 -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote: > > If the reflink caller is always charged for the full space used (as if > > it were a real copy) by virtue of the user doing the reflink() owning the > > new inode. Doing anything else seems broken. If the owner of the file > > wasn't charged for the reflink's quota then if the reflink inode was > > chowned the new owner would be charged for the new file, but the quota > > code would have to special case the decrement of EACH of the reflink's > > blocks because otherwise the original owner might "release" quota that > > it was never originally charged. > > If the caller is creating an inode in someone else's name, then > who do you charge for the quota? IMHO, it shouldn't be possible to create an inode in someone else's name (CAP_* excluded), just like it isn't possible to create a new file in someone elses name. The caller of reflink() should be the one creating the file, hence the owner of the file, and the owner of the quota. > If you charge the caller, how do you know to decrement the caller's > quota when the actual owner does truncate, given that the inode has > no knowledge of the caller anymore. No, if the owner of the inode (== caller) is charged the quota then when the inode is truncated (regardless of who does the truncate) the quota will just work correctly. > You've hit the nail on the head - without backrefs for each > refcounted hunk, you can't figure out who it owns it from a quota > perspective. And that's just a non-starter to try and maintain. No, I don't think my proposal is _more_ complex than the original. It is actually _less_ complex, because the fact that this is a reflink and not a complete file copy is a purely internal detail of the filesystem and is not exposed outside the filesystem. The fact that a reflink consumes less space and is faster than a real copy is an implementation detail, not really any different than if the file were compressed by the filesystem internally. > > > Here's another fun trick. Overwriting rsync, instead of copying > > > blocks from the already-existing source could reflink the source to the > > > .temporary, then only write the changed blocks. And since you own both > > > files, it just works. If you're overwriting someone else's file? The > > > old copy behavior is fine. > > > > Well, "fine" as in it works, but if there are only a few changed blocks, > > and the old copy is now part of a snapshot (so it won't be released when > > rsync is finished) the space consumption has doubled instead of just > > using a few extra blocks. > > No, because the last thing rsync will do is rename(.temporary, > source). All the references from the source will be decremented, and > any blocks only owned by the source will be freed. Space usage is > identical before and after, like a copying rsync, but there is less > space used and less I/O done during the rsync process. What I was objecting to is "when overwriting someone elses file, the old copy behaviour is fine". If we are implementing a copy-on-write API, why hamstring it to not work in the expected manner by a normal "cp"? > > Is there anything about changing the owner/group of the new inode during > > reflink that makes the implementation more complex? If the process doing > > the reflink is the same as the file owner then the semantics are unchanged > > from what you have proposed. > > If you define that 'reflink sets the attributes as if it was a > new file', then you should be creating the file with a new security > context, not with the security context from the existing inode. And > then you can't really snapshot. > A mixed behavior, like "if you own it, I'll preserve the entire > security context, but if not I will treat it with a new context" is > confusing at best. I don't find it confusing. The security context would be inherited from the creating process, just like creating a new file would. If it is the same user as the file owner then the security context will be the same. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.