All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] fs: add AT_REPLACE flag for linkat()
@ 2016-11-22  8:25 Omar Sandoval
  2016-11-22  8:25   ` Omar Sandoval
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-api, kernel-team, Xi Wang

From: Omar Sandoval <osandov@fb.com>

This is a proof-of-concept patch series implementing an AT_REPLACE flag
for linkat(2) which allows us to replace the target. This is a nice
primitive on its own, but it's most interesting when combined with
O_TMPFILE, as it allows you to do an atomic update of a file with an
O_TMPFILE.

Patch 1 implements the VFS support for this flag. The implementation
resembles sys_renameat2(), and I took care to preserve all of the
original error cases and make the new error cases consistent with
rename.

Patch 2 adds a dcache helper for filesystem implementations of
AT_REPLACE. I'm not entirely convinced that it's 100% correct.

Patch 3 adds support for AT_REPLACE to Btrfs. That's the codebase I'm
most familiar with so that's where I started, but it should be
straightforward to implement for other filesystems.

Any comments are welcome.

Cc: Xi Wang <xi@cs.washington.edu>

Omar Sandoval (3):
  fs: add AT_REPLACE flag for linkat() which replaces the target
  vfs: add d_replace()
  Btrfs: add support for linkat() AT_REPLACE

 fs/btrfs/inode.c           |  59 ++++++++++++++-
 fs/dcache.c                |  68 +++++++++++++++--
 fs/ecryptfs/inode.c        |   2 +-
 fs/namei.c                 | 180 +++++++++++++++++++++++++++++++++++----------
 fs/nfsd/vfs.c              |   2 +-
 fs/overlayfs/overlayfs.h   |   2 +-
 include/linux/dcache.h     |   1 +
 include/linux/fs.h         |   3 +-
 include/uapi/linux/fcntl.h |   1 +
 9 files changed, 267 insertions(+), 51 deletions(-)

-- 
2.10.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/3] fs: add AT_REPLACE flag for linkat() which replaces the target
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-api, kernel-team, Xi Wang

From: Omar Sandoval <osandov@fb.com>

One of the most common uses of temporary files is the classic atomic
replacement pattern, i.e.,

- write temporary file
- fsync temporary file
- rename temporary file over real file
- fsync parent directory

Now, we have O_TMPFILE, which gives us a much better way to create
temporary files, but it's not possible to use it for this pattern.

This patch introduces an AT_REPLACE flag which allows linkat() to
replace the target file. Now, the temporary file in the pattern above
can be a proper O_TMPFILE. Even without O_TMPFILE, this is a new
primitive which might be useful in other contexts.

Cc: Xi Wang <xi@cs.washington.edu>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/ecryptfs/inode.c        |   2 +-
 fs/namei.c                 | 180 +++++++++++++++++++++++++++++++++++----------
 fs/nfsd/vfs.c              |   2 +-
 fs/overlayfs/overlayfs.h   |   2 +-
 include/linux/fs.h         |   3 +-
 include/uapi/linux/fcntl.h |   1 +
 6 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index cf390dc..d38c24a 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -442,7 +442,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
 	dget(lower_new_dentry);
 	lower_dir_dentry = lock_parent(lower_new_dentry);
 	rc = vfs_link(lower_old_dentry, d_inode(lower_dir_dentry),
-		      lower_new_dentry, NULL);
+		      lower_new_dentry, NULL, 0);
 	if (rc || d_really_is_negative(lower_new_dentry))
 		goto out_lock;
 	rc = ecryptfs_interpose(lower_new_dentry, new_dentry, dir->i_sb);
diff --git a/fs/namei.c b/fs/namei.c
index 5b4eed2..94681be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4144,16 +4144,26 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
  * be appropriate for callers that expect the underlying filesystem not
  * to be NFS exported.
  */
-int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, struct inode **delegated_inode)
+int vfs_link(struct dentry *old_dentry, struct inode *dir,
+	     struct dentry *new_dentry, struct inode **delegated_inode,
+	     unsigned int flags)
 {
 	struct inode *inode = old_dentry->d_inode;
+	struct inode *target = new_dentry->d_inode;
 	unsigned max_links = dir->i_sb->s_max_links;
 	int error;
 
 	if (!inode)
 		return -ENOENT;
 
-	error = may_create(dir, new_dentry);
+	if (target) {
+		if (flags & AT_REPLACE)
+			error = may_delete(dir, new_dentry, d_is_dir(old_dentry));
+		else
+			error = -EEXIST;
+	} else {
+		error = may_create(dir, new_dentry);
+	}
 	if (error)
 		return error;
 
@@ -4172,8 +4182,10 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (HAS_UNMAPPED_ID(inode))
 		return -EPERM;
-	if (!dir->i_op->link)
+	if (!dir->i_op->link && !dir->i_op->link2)
 		return -EPERM;
+	if (flags && !dir->i_op->link2)
+		return -EINVAL;
 	if (S_ISDIR(inode->i_mode))
 		return -EPERM;
 
@@ -4181,26 +4193,58 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	if (error)
 		return error;
 
-	inode_lock(inode);
+	dget(new_dentry);
+	lock_two_nondirectories(inode, target);
+
+	if (is_local_mountpoint(new_dentry)) {
+		error = -EBUSY;
+		goto out;
+	}
+
 	/* Make sure we don't allow creating hardlink to an unlinked file */
-	if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
+	if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE)) {
 		error =  -ENOENT;
-	else if (max_links && inode->i_nlink >= max_links)
+		goto out;
+	}
+	if (max_links && inode->i_nlink >= max_links) {
 		error = -EMLINK;
-	else {
-		error = try_break_deleg(inode, delegated_inode);
-		if (!error)
-			error = dir->i_op->link(old_dentry, dir, new_dentry);
+		goto out;
+	}
+
+	error = try_break_deleg(inode, delegated_inode);
+	if (error)
+		goto out;
+	if (target) {
+		error = try_break_deleg(target, delegated_inode);
+		if (error)
+			goto out;
+	}
+
+	if (dir->i_op->link)
+		error = dir->i_op->link(old_dentry, dir, new_dentry);
+	else
+		error = dir->i_op->link2(old_dentry, dir, new_dentry, flags);
+	if (error)
+		goto out;
+
+	if (target) {
+		dont_mount(new_dentry);
+		detach_mounts(new_dentry);
 	}
 
-	if (!error && (inode->i_state & I_LINKABLE)) {
+	if (inode->i_state & I_LINKABLE) {
 		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_LINKABLE;
 		spin_unlock(&inode->i_lock);
 	}
-	inode_unlock(inode);
-	if (!error)
+out:
+	unlock_two_nondirectories(inode, target);
+	dput(new_dentry);
+	if (!error) {
+		if (target)
+			fsnotify_link_count(target);
 		fsnotify_link(dir, inode, new_dentry);
+	}
 	return error;
 }
 EXPORT_SYMBOL(vfs_link);
@@ -4219,11 +4263,15 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 {
 	struct dentry *new_dentry;
 	struct path old_path, new_path;
+	struct qstr new_last;
+	int new_type;
 	struct inode *delegated_inode = NULL;
-	int how = 0;
+	struct filename *to;
+	unsigned int how = 0, target_flags;
+	bool should_retry = false;
 	int error;
 
-	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_REPLACE)) != 0)
 		return -EINVAL;
 	/*
 	 * To use null names we require CAP_DAC_READ_SEARCH
@@ -4238,44 +4286,102 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 
 	if (flags & AT_SYMLINK_FOLLOW)
 		how |= LOOKUP_FOLLOW;
+
+	if (flags & AT_REPLACE)
+		target_flags = LOOKUP_RENAME_TARGET;
+	else
+		target_flags = LOOKUP_CREATE | LOOKUP_EXCL;
 retry:
 	error = user_path_at(olddfd, oldname, how, &old_path);
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path,
-					(how & LOOKUP_REVAL));
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
-		goto out;
+	to = user_path_parent(newdfd, newname, &new_path, &new_last, &new_type,
+			      (how & LOOKUP_REVAL));
+	if (IS_ERR(to)) {
+		error = PTR_ERR(to);
+		goto exit1;
+	}
+
+	if (old_path.mnt != new_path.mnt) {
+		error = -EXDEV;
+		goto exit2;
+	}
+
+	if (new_type != LAST_NORM) {
+		if (flags & AT_REPLACE)
+			error = -EBUSY;
+		else
+			error = -EEXIST;
+		goto exit2;
+	}
+
+	error = mnt_want_write(old_path.mnt);
+	if (error)
+		goto exit2;
+
+retry_deleg:
+	inode_lock_nested(new_path.dentry->d_inode, I_MUTEX_PARENT);
+
+	new_dentry = __lookup_hash(&new_last, new_path.dentry,
+				   (how & LOOKUP_REVAL) | target_flags);
+	if (IS_ERR(new_dentry)) {
+		error = PTR_ERR(new_dentry);
+		goto exit3;
+	}
+	if (!(flags & AT_REPLACE) && d_is_positive(new_dentry)) {
+		error = -EEXIST;
+		goto exit4;
+	}
+	if (new_last.name[new_last.len]) {
+		/* trailing slash on negative dentry gives -ENOENT */
+		if (d_is_negative(new_dentry)) {
+			error = -ENOENT;
+			goto exit4;
+		}
+
+		/*
+		 * unless the source is a directory, trailing slash gives
+		 * -ENOTDIR (this can only happen in the AT_REPLACE case, so we
+		 * make this consistent with sys_renameat2() even though a
+		 * source directory will fail later with -EPERM)
+		 */
+		if (!d_is_dir(old_path.dentry)) {
+			error = -ENOTDIR;
+			goto exit4;
+		}
+	}
 
-	error = -EXDEV;
-	if (old_path.mnt != new_path.mnt)
-		goto out_dput;
 	error = may_linkat(&old_path);
 	if (unlikely(error))
-		goto out_dput;
+		goto exit4;
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
-		goto out_dput;
-	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
-out_dput:
-	done_path_create(&new_path, new_dentry);
+		goto exit4;
+	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry,
+			 &delegated_inode, flags & AT_REPLACE);
+exit4:
+	dput(new_dentry);
+exit3:
+	inode_unlock(new_path.dentry->d_inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
-		if (!error) {
-			path_put(&old_path);
-			goto retry;
-		}
+		if (!error)
+			goto retry_deleg;
 	}
-	if (retry_estale(error, how)) {
-		path_put(&old_path);
+	mnt_drop_write(old_path.mnt);
+exit2:
+	if (retry_estale(error, how))
+		should_retry = true;
+	path_put(&new_path);
+	putname(to);
+exit1:
+	path_put(&old_path);
+	if (should_retry) {
+		should_retry = false;
 		how |= LOOKUP_REVAL;
 		goto retry;
 	}
-out:
-	path_put(&old_path);
-
 	return error;
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8ca642f..0a16599 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1594,7 +1594,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 	err = nfserr_noent;
 	if (d_really_is_negative(dold))
 		goto out_dput;
-	host_err = vfs_link(dold, dirp, dnew, NULL);
+	host_err = vfs_link(dold, dirp, dnew, NULL, 0);
 	if (!host_err) {
 		err = nfserrno(commit_metadata(ffhp));
 		if (!err)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index e218e74..71b2f52 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -46,7 +46,7 @@ static inline int ovl_do_unlink(struct inode *dir, struct dentry *dentry)
 static inline int ovl_do_link(struct dentry *old_dentry, struct inode *dir,
 			      struct dentry *new_dentry, bool debug)
 {
-	int err = vfs_link(old_dentry, dir, new_dentry, NULL);
+	int err = vfs_link(old_dentry, dir, new_dentry, NULL, 0);
 	if (debug) {
 		pr_debug("link(%pd2, %pd2) = %i\n",
 			 old_dentry, new_dentry, err);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c..2efa6c1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1606,7 +1606,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
 extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
 extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, const char *);
-extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
+extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
@@ -1743,6 +1743,7 @@ struct inode_operations {
 
 	int (*create) (struct inode *,struct dentry *, umode_t, bool);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*link2) (struct dentry *,struct inode *,struct dentry *,unsigned int);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
 	int (*mkdir) (struct inode *,struct dentry *,umode_t);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index beed138..6c1f293 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -62,6 +62,7 @@
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 #define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
 #define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
+#define AT_REPLACE		0x2000	/* Replace new path */
 
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/3] fs: add AT_REPLACE flag for linkat() which replaces the target
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Xi Wang

From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>

One of the most common uses of temporary files is the classic atomic
replacement pattern, i.e.,

- write temporary file
- fsync temporary file
- rename temporary file over real file
- fsync parent directory

Now, we have O_TMPFILE, which gives us a much better way to create
temporary files, but it's not possible to use it for this pattern.

This patch introduces an AT_REPLACE flag which allows linkat() to
replace the target file. Now, the temporary file in the pattern above
can be a proper O_TMPFILE. Even without O_TMPFILE, this is a new
primitive which might be useful in other contexts.

Cc: Xi Wang <xi-GmWTxIRN22iJaUV4rX00uodd74u8MsAO@public.gmane.org>
Signed-off-by: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
---
 fs/ecryptfs/inode.c        |   2 +-
 fs/namei.c                 | 180 +++++++++++++++++++++++++++++++++++----------
 fs/nfsd/vfs.c              |   2 +-
 fs/overlayfs/overlayfs.h   |   2 +-
 include/linux/fs.h         |   3 +-
 include/uapi/linux/fcntl.h |   1 +
 6 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index cf390dc..d38c24a 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -442,7 +442,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
 	dget(lower_new_dentry);
 	lower_dir_dentry = lock_parent(lower_new_dentry);
 	rc = vfs_link(lower_old_dentry, d_inode(lower_dir_dentry),
-		      lower_new_dentry, NULL);
+		      lower_new_dentry, NULL, 0);
 	if (rc || d_really_is_negative(lower_new_dentry))
 		goto out_lock;
 	rc = ecryptfs_interpose(lower_new_dentry, new_dentry, dir->i_sb);
diff --git a/fs/namei.c b/fs/namei.c
index 5b4eed2..94681be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4144,16 +4144,26 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
  * be appropriate for callers that expect the underlying filesystem not
  * to be NFS exported.
  */
-int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, struct inode **delegated_inode)
+int vfs_link(struct dentry *old_dentry, struct inode *dir,
+	     struct dentry *new_dentry, struct inode **delegated_inode,
+	     unsigned int flags)
 {
 	struct inode *inode = old_dentry->d_inode;
+	struct inode *target = new_dentry->d_inode;
 	unsigned max_links = dir->i_sb->s_max_links;
 	int error;
 
 	if (!inode)
 		return -ENOENT;
 
-	error = may_create(dir, new_dentry);
+	if (target) {
+		if (flags & AT_REPLACE)
+			error = may_delete(dir, new_dentry, d_is_dir(old_dentry));
+		else
+			error = -EEXIST;
+	} else {
+		error = may_create(dir, new_dentry);
+	}
 	if (error)
 		return error;
 
@@ -4172,8 +4182,10 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (HAS_UNMAPPED_ID(inode))
 		return -EPERM;
-	if (!dir->i_op->link)
+	if (!dir->i_op->link && !dir->i_op->link2)
 		return -EPERM;
+	if (flags && !dir->i_op->link2)
+		return -EINVAL;
 	if (S_ISDIR(inode->i_mode))
 		return -EPERM;
 
@@ -4181,26 +4193,58 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	if (error)
 		return error;
 
-	inode_lock(inode);
+	dget(new_dentry);
+	lock_two_nondirectories(inode, target);
+
+	if (is_local_mountpoint(new_dentry)) {
+		error = -EBUSY;
+		goto out;
+	}
+
 	/* Make sure we don't allow creating hardlink to an unlinked file */
-	if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
+	if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE)) {
 		error =  -ENOENT;
-	else if (max_links && inode->i_nlink >= max_links)
+		goto out;
+	}
+	if (max_links && inode->i_nlink >= max_links) {
 		error = -EMLINK;
-	else {
-		error = try_break_deleg(inode, delegated_inode);
-		if (!error)
-			error = dir->i_op->link(old_dentry, dir, new_dentry);
+		goto out;
+	}
+
+	error = try_break_deleg(inode, delegated_inode);
+	if (error)
+		goto out;
+	if (target) {
+		error = try_break_deleg(target, delegated_inode);
+		if (error)
+			goto out;
+	}
+
+	if (dir->i_op->link)
+		error = dir->i_op->link(old_dentry, dir, new_dentry);
+	else
+		error = dir->i_op->link2(old_dentry, dir, new_dentry, flags);
+	if (error)
+		goto out;
+
+	if (target) {
+		dont_mount(new_dentry);
+		detach_mounts(new_dentry);
 	}
 
-	if (!error && (inode->i_state & I_LINKABLE)) {
+	if (inode->i_state & I_LINKABLE) {
 		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_LINKABLE;
 		spin_unlock(&inode->i_lock);
 	}
-	inode_unlock(inode);
-	if (!error)
+out:
+	unlock_two_nondirectories(inode, target);
+	dput(new_dentry);
+	if (!error) {
+		if (target)
+			fsnotify_link_count(target);
 		fsnotify_link(dir, inode, new_dentry);
+	}
 	return error;
 }
 EXPORT_SYMBOL(vfs_link);
@@ -4219,11 +4263,15 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 {
 	struct dentry *new_dentry;
 	struct path old_path, new_path;
+	struct qstr new_last;
+	int new_type;
 	struct inode *delegated_inode = NULL;
-	int how = 0;
+	struct filename *to;
+	unsigned int how = 0, target_flags;
+	bool should_retry = false;
 	int error;
 
-	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH | AT_REPLACE)) != 0)
 		return -EINVAL;
 	/*
 	 * To use null names we require CAP_DAC_READ_SEARCH
@@ -4238,44 +4286,102 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 
 	if (flags & AT_SYMLINK_FOLLOW)
 		how |= LOOKUP_FOLLOW;
+
+	if (flags & AT_REPLACE)
+		target_flags = LOOKUP_RENAME_TARGET;
+	else
+		target_flags = LOOKUP_CREATE | LOOKUP_EXCL;
 retry:
 	error = user_path_at(olddfd, oldname, how, &old_path);
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path,
-					(how & LOOKUP_REVAL));
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
-		goto out;
+	to = user_path_parent(newdfd, newname, &new_path, &new_last, &new_type,
+			      (how & LOOKUP_REVAL));
+	if (IS_ERR(to)) {
+		error = PTR_ERR(to);
+		goto exit1;
+	}
+
+	if (old_path.mnt != new_path.mnt) {
+		error = -EXDEV;
+		goto exit2;
+	}
+
+	if (new_type != LAST_NORM) {
+		if (flags & AT_REPLACE)
+			error = -EBUSY;
+		else
+			error = -EEXIST;
+		goto exit2;
+	}
+
+	error = mnt_want_write(old_path.mnt);
+	if (error)
+		goto exit2;
+
+retry_deleg:
+	inode_lock_nested(new_path.dentry->d_inode, I_MUTEX_PARENT);
+
+	new_dentry = __lookup_hash(&new_last, new_path.dentry,
+				   (how & LOOKUP_REVAL) | target_flags);
+	if (IS_ERR(new_dentry)) {
+		error = PTR_ERR(new_dentry);
+		goto exit3;
+	}
+	if (!(flags & AT_REPLACE) && d_is_positive(new_dentry)) {
+		error = -EEXIST;
+		goto exit4;
+	}
+	if (new_last.name[new_last.len]) {
+		/* trailing slash on negative dentry gives -ENOENT */
+		if (d_is_negative(new_dentry)) {
+			error = -ENOENT;
+			goto exit4;
+		}
+
+		/*
+		 * unless the source is a directory, trailing slash gives
+		 * -ENOTDIR (this can only happen in the AT_REPLACE case, so we
+		 * make this consistent with sys_renameat2() even though a
+		 * source directory will fail later with -EPERM)
+		 */
+		if (!d_is_dir(old_path.dentry)) {
+			error = -ENOTDIR;
+			goto exit4;
+		}
+	}
 
-	error = -EXDEV;
-	if (old_path.mnt != new_path.mnt)
-		goto out_dput;
 	error = may_linkat(&old_path);
 	if (unlikely(error))
-		goto out_dput;
+		goto exit4;
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
-		goto out_dput;
-	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
-out_dput:
-	done_path_create(&new_path, new_dentry);
+		goto exit4;
+	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry,
+			 &delegated_inode, flags & AT_REPLACE);
+exit4:
+	dput(new_dentry);
+exit3:
+	inode_unlock(new_path.dentry->d_inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
-		if (!error) {
-			path_put(&old_path);
-			goto retry;
-		}
+		if (!error)
+			goto retry_deleg;
 	}
-	if (retry_estale(error, how)) {
-		path_put(&old_path);
+	mnt_drop_write(old_path.mnt);
+exit2:
+	if (retry_estale(error, how))
+		should_retry = true;
+	path_put(&new_path);
+	putname(to);
+exit1:
+	path_put(&old_path);
+	if (should_retry) {
+		should_retry = false;
 		how |= LOOKUP_REVAL;
 		goto retry;
 	}
-out:
-	path_put(&old_path);
-
 	return error;
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8ca642f..0a16599 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1594,7 +1594,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 	err = nfserr_noent;
 	if (d_really_is_negative(dold))
 		goto out_dput;
-	host_err = vfs_link(dold, dirp, dnew, NULL);
+	host_err = vfs_link(dold, dirp, dnew, NULL, 0);
 	if (!host_err) {
 		err = nfserrno(commit_metadata(ffhp));
 		if (!err)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index e218e74..71b2f52 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -46,7 +46,7 @@ static inline int ovl_do_unlink(struct inode *dir, struct dentry *dentry)
 static inline int ovl_do_link(struct dentry *old_dentry, struct inode *dir,
 			      struct dentry *new_dentry, bool debug)
 {
-	int err = vfs_link(old_dentry, dir, new_dentry, NULL);
+	int err = vfs_link(old_dentry, dir, new_dentry, NULL, 0);
 	if (debug) {
 		pr_debug("link(%pd2, %pd2) = %i\n",
 			 old_dentry, new_dentry, err);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dc0478c..2efa6c1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1606,7 +1606,7 @@ extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
 extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
 extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
 extern int vfs_symlink(struct inode *, struct dentry *, const char *);
-extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
+extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
@@ -1743,6 +1743,7 @@ struct inode_operations {
 
 	int (*create) (struct inode *,struct dentry *, umode_t, bool);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*link2) (struct dentry *,struct inode *,struct dentry *,unsigned int);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
 	int (*mkdir) (struct inode *,struct dentry *,umode_t);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index beed138..6c1f293 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -62,6 +62,7 @@
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 #define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
 #define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
+#define AT_REPLACE		0x2000	/* Replace new path */
 
 
 #endif /* _UAPI_LINUX_FCNTL_H */
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

Changes the inode associated with a dentry. This'll be useful for
implementations of linkat() AT_REPLACE.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/dcache.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/dcache.h |  1 +
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7cc95..aaa2b16 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,16 @@ static void dentry_free(struct dentry *dentry)
 		call_rcu(&dentry->d_u.d_rcu, __d_free);
 }
 
+static void dentry_iput(struct dentry *dentry, struct inode *inode)
+{
+	if (!inode->i_nlink)
+		fsnotify_inoderemove(inode);
+	if (dentry->d_op && dentry->d_op->d_iput)
+		dentry->d_op->d_iput(dentry, inode);
+	else
+		iput(inode);
+}
+
 /*
  * Release the dentry's inode, using the filesystem
  * d_iput() operation if defined.
@@ -335,12 +345,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 		raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
-	if (!inode->i_nlink)
-		fsnotify_inoderemove(inode);
-	if (dentry->d_op && dentry->d_op->d_iput)
-		dentry->d_op->d_iput(dentry, inode);
-	else
-		iput(inode);
+	dentry_iput(dentry, inode);
 }
 
 /*
@@ -1816,6 +1821,24 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 }
 EXPORT_SYMBOL(d_instantiate);
 
+static void lock_two_inodes(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1 > inode2)
+		swap(inode1, inode2);
+	if (inode1)
+		spin_lock(&inode1->i_lock);
+	if (inode2)
+		spin_lock(&inode2->i_lock);
+}
+
+static void unlock_two_inodes(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1)
+		spin_unlock(&inode1->i_lock);
+	if (inode2)
+		spin_unlock(&inode2->i_lock);
+}
+
 /**
  * d_instantiate_no_diralias - instantiate a non-aliased dentry
  * @entry: dentry to complete
@@ -2339,6 +2362,39 @@ void d_delete(struct dentry * dentry)
 }
 EXPORT_SYMBOL(d_delete);
 
+/**
+ * d_replace - change the inode a dentry is associated with
+ * @dentry: dentry to modify
+ * @inode: inode to attach to this dentry
+ *
+ * Fill in new inode information in a dentry that may have previously been
+ * instantiated. This handles both negative and positive dentries.
+ */
+void d_replace(struct dentry *dentry, struct inode *inode)
+{
+	struct inode *old_inode = dentry->d_inode;
+	unsigned int add_flags;
+
+	lock_two_inodes(old_inode, inode);
+	spin_lock(&dentry->d_lock);
+	add_flags = d_flags_for_inode(inode);
+
+	if (old_inode)
+		hlist_del(&dentry->d_u.d_alias);
+	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+
+	raw_write_seqcount_begin(&dentry->d_seq);
+	__d_set_inode_and_type(dentry, inode, add_flags);
+	raw_write_seqcount_end(&dentry->d_seq);
+	fsnotify_update_flags(dentry);
+
+	spin_unlock(&dentry->d_lock);
+	unlock_two_inodes(old_inode, inode);
+	if (old_inode)
+		dentry_iput(dentry, old_inode);
+}
+EXPORT_SYMBOL(d_replace);
+
 static void __d_rehash(struct dentry *entry)
 {
 	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5beed7b..0610bb0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -224,6 +224,7 @@ extern int d_instantiate_no_diralias(struct dentry *, struct inode *);
 extern void __d_drop(struct dentry *dentry);
 extern void d_drop(struct dentry *dentry);
 extern void d_delete(struct dentry *);
+extern void d_replace(struct dentry *, struct inode *);
 extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op);
 
 /* allocate/de-allocate */
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>

Changes the inode associated with a dentry. This'll be useful for
implementations of linkat() AT_REPLACE.

Signed-off-by: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
---
 fs/dcache.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/dcache.h |  1 +
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7cc95..aaa2b16 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,16 @@ static void dentry_free(struct dentry *dentry)
 		call_rcu(&dentry->d_u.d_rcu, __d_free);
 }
 
+static void dentry_iput(struct dentry *dentry, struct inode *inode)
+{
+	if (!inode->i_nlink)
+		fsnotify_inoderemove(inode);
+	if (dentry->d_op && dentry->d_op->d_iput)
+		dentry->d_op->d_iput(dentry, inode);
+	else
+		iput(inode);
+}
+
 /*
  * Release the dentry's inode, using the filesystem
  * d_iput() operation if defined.
@@ -335,12 +345,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 		raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
-	if (!inode->i_nlink)
-		fsnotify_inoderemove(inode);
-	if (dentry->d_op && dentry->d_op->d_iput)
-		dentry->d_op->d_iput(dentry, inode);
-	else
-		iput(inode);
+	dentry_iput(dentry, inode);
 }
 
 /*
@@ -1816,6 +1821,24 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 }
 EXPORT_SYMBOL(d_instantiate);
 
+static void lock_two_inodes(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1 > inode2)
+		swap(inode1, inode2);
+	if (inode1)
+		spin_lock(&inode1->i_lock);
+	if (inode2)
+		spin_lock(&inode2->i_lock);
+}
+
+static void unlock_two_inodes(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1)
+		spin_unlock(&inode1->i_lock);
+	if (inode2)
+		spin_unlock(&inode2->i_lock);
+}
+
 /**
  * d_instantiate_no_diralias - instantiate a non-aliased dentry
  * @entry: dentry to complete
@@ -2339,6 +2362,39 @@ void d_delete(struct dentry * dentry)
 }
 EXPORT_SYMBOL(d_delete);
 
+/**
+ * d_replace - change the inode a dentry is associated with
+ * @dentry: dentry to modify
+ * @inode: inode to attach to this dentry
+ *
+ * Fill in new inode information in a dentry that may have previously been
+ * instantiated. This handles both negative and positive dentries.
+ */
+void d_replace(struct dentry *dentry, struct inode *inode)
+{
+	struct inode *old_inode = dentry->d_inode;
+	unsigned int add_flags;
+
+	lock_two_inodes(old_inode, inode);
+	spin_lock(&dentry->d_lock);
+	add_flags = d_flags_for_inode(inode);
+
+	if (old_inode)
+		hlist_del(&dentry->d_u.d_alias);
+	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+
+	raw_write_seqcount_begin(&dentry->d_seq);
+	__d_set_inode_and_type(dentry, inode, add_flags);
+	raw_write_seqcount_end(&dentry->d_seq);
+	fsnotify_update_flags(dentry);
+
+	spin_unlock(&dentry->d_lock);
+	unlock_two_inodes(old_inode, inode);
+	if (old_inode)
+		dentry_iput(dentry, old_inode);
+}
+EXPORT_SYMBOL(d_replace);
+
 static void __d_rehash(struct dentry *entry)
 {
 	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5beed7b..0610bb0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -224,6 +224,7 @@ extern int d_instantiate_no_diralias(struct dentry *, struct inode *);
 extern void __d_drop(struct dentry *dentry);
 extern void d_drop(struct dentry *dentry);
 extern void d_delete(struct dentry *);
+extern void d_replace(struct dentry *, struct inode *);
 extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op);
 
 /* allocate/de-allocate */
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/3] Btrfs: add support for linkat() AT_REPLACE
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-api, kernel-team

From: Omar Sandoval <osandov@fb.com>

The implementation is fairly straightforward and looks a lot like
btrfs_rename().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e3a5a2..c903f6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6566,15 +6566,20 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 }
 
 static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
-		      struct dentry *dentry)
+		      struct dentry *dentry, unsigned int flags)
 {
 	struct btrfs_trans_handle *trans = NULL;
+	unsigned int trans_num_items;
 	struct btrfs_root *root = BTRFS_I(dir)->root;
 	struct inode *inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(dentry);
 	u64 index;
 	int err;
 	int drop_inode = 0;
 
+	if (flags & ~AT_REPLACE)
+		return -EINVAL;
+
 	/* do not allow sys_link's with other subvols of the same device */
 	if (root->objectid != BTRFS_I(inode)->root->objectid)
 		return -EXDEV;
@@ -6582,16 +6587,47 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	if (inode->i_nlink >= BTRFS_LINK_MAX)
 		return -EMLINK;
 
+	/* check for collisions, even if the  name isn't there */
+	err = btrfs_check_dir_item_collision(root, dir->i_ino,
+					     dentry->d_name.name,
+					     dentry->d_name.len);
+	if (err) {
+		if (err == -EEXIST) {
+			if (WARN_ON(!new_inode))
+				return err;
+		} else {
+			return err;
+		}
+	}
+
+	/*
+	 * we're using link to replace one file with another. Start IO on it now
+	 * so we don't add too much work to the end of the transaction
+	 */
+	if (new_inode && S_ISREG(inode->i_mode) && new_inode->i_size)
+		filemap_flush(inode->i_mapping);
+
 	err = btrfs_set_inode_index(dir, &index);
 	if (err)
 		goto fail;
 
 	/*
+	 * For the source:
 	 * 2 items for inode and inode ref
 	 * 2 items for dir items
 	 * 1 item for parent inode
+	 *
+	 * For the target:
+	 * 1 for the possible orphan item
+	 * 1 for the dir item
+	 * 1 for the dir index
+	 * 1 for the inode ref
+	 * 1 for the inode
 	 */
-	trans = btrfs_start_transaction(root, 5);
+	trans_num_items = 5;
+	if (new_inode)
+		trans_num_items += 5;
+	trans = btrfs_start_transaction(root, trans_num_items);
 	if (IS_ERR(trans)) {
 		err = PTR_ERR(trans);
 		trans = NULL;
@@ -6603,6 +6639,21 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode_inc_iversion(inode);
 	inode->i_ctime = current_time(inode);
+
+	if (new_inode) {
+		inode_inc_iversion(new_inode);
+		new_inode->i_ctime = current_time(new_inode);
+		err = btrfs_unlink_inode(trans, root, dir, new_inode,
+					 dentry->d_name.name,
+					 dentry->d_name.len);
+		if (!err && new_inode->i_nlink == 0)
+			err = btrfs_orphan_add(trans, new_inode);
+		if (err) {
+			btrfs_abort_transaction(trans, err);
+			goto fail;
+		}
+	}
+
 	ihold(inode);
 	set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags);
 
@@ -6624,7 +6675,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 			if (err)
 				goto fail;
 		}
-		d_instantiate(dentry, inode);
+		d_replace(dentry, inode);
 		btrfs_log_new_name(trans, inode, NULL, parent);
 	}
 
@@ -10570,7 +10621,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.lookup		= btrfs_lookup,
 	.create		= btrfs_create,
 	.unlink		= btrfs_unlink,
-	.link		= btrfs_link,
+	.link2		= btrfs_link,
 	.mkdir		= btrfs_mkdir,
 	.rmdir		= btrfs_rmdir,
 	.rename		= btrfs_rename2,
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/3] Btrfs: add support for linkat() AT_REPLACE
@ 2016-11-22  8:25   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-22  8:25 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>

The implementation is fairly straightforward and looks a lot like
btrfs_rename().

Signed-off-by: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
---
 fs/btrfs/inode.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e3a5a2..c903f6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6566,15 +6566,20 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 }
 
 static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
-		      struct dentry *dentry)
+		      struct dentry *dentry, unsigned int flags)
 {
 	struct btrfs_trans_handle *trans = NULL;
+	unsigned int trans_num_items;
 	struct btrfs_root *root = BTRFS_I(dir)->root;
 	struct inode *inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(dentry);
 	u64 index;
 	int err;
 	int drop_inode = 0;
 
+	if (flags & ~AT_REPLACE)
+		return -EINVAL;
+
 	/* do not allow sys_link's with other subvols of the same device */
 	if (root->objectid != BTRFS_I(inode)->root->objectid)
 		return -EXDEV;
@@ -6582,16 +6587,47 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	if (inode->i_nlink >= BTRFS_LINK_MAX)
 		return -EMLINK;
 
+	/* check for collisions, even if the  name isn't there */
+	err = btrfs_check_dir_item_collision(root, dir->i_ino,
+					     dentry->d_name.name,
+					     dentry->d_name.len);
+	if (err) {
+		if (err == -EEXIST) {
+			if (WARN_ON(!new_inode))
+				return err;
+		} else {
+			return err;
+		}
+	}
+
+	/*
+	 * we're using link to replace one file with another. Start IO on it now
+	 * so we don't add too much work to the end of the transaction
+	 */
+	if (new_inode && S_ISREG(inode->i_mode) && new_inode->i_size)
+		filemap_flush(inode->i_mapping);
+
 	err = btrfs_set_inode_index(dir, &index);
 	if (err)
 		goto fail;
 
 	/*
+	 * For the source:
 	 * 2 items for inode and inode ref
 	 * 2 items for dir items
 	 * 1 item for parent inode
+	 *
+	 * For the target:
+	 * 1 for the possible orphan item
+	 * 1 for the dir item
+	 * 1 for the dir index
+	 * 1 for the inode ref
+	 * 1 for the inode
 	 */
-	trans = btrfs_start_transaction(root, 5);
+	trans_num_items = 5;
+	if (new_inode)
+		trans_num_items += 5;
+	trans = btrfs_start_transaction(root, trans_num_items);
 	if (IS_ERR(trans)) {
 		err = PTR_ERR(trans);
 		trans = NULL;
@@ -6603,6 +6639,21 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode_inc_iversion(inode);
 	inode->i_ctime = current_time(inode);
+
+	if (new_inode) {
+		inode_inc_iversion(new_inode);
+		new_inode->i_ctime = current_time(new_inode);
+		err = btrfs_unlink_inode(trans, root, dir, new_inode,
+					 dentry->d_name.name,
+					 dentry->d_name.len);
+		if (!err && new_inode->i_nlink == 0)
+			err = btrfs_orphan_add(trans, new_inode);
+		if (err) {
+			btrfs_abort_transaction(trans, err);
+			goto fail;
+		}
+	}
+
 	ihold(inode);
 	set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags);
 
@@ -6624,7 +6675,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 			if (err)
 				goto fail;
 		}
-		d_instantiate(dentry, inode);
+		d_replace(dentry, inode);
 		btrfs_log_new_name(trans, inode, NULL, parent);
 	}
 
@@ -10570,7 +10621,7 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.lookup		= btrfs_lookup,
 	.create		= btrfs_create,
 	.unlink		= btrfs_unlink,
-	.link		= btrfs_link,
+	.link2		= btrfs_link,
 	.mkdir		= btrfs_mkdir,
 	.rmdir		= btrfs_rmdir,
 	.rename		= btrfs_rename2,
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] fs: add AT_REPLACE flag for linkat() which replaces the target
@ 2016-11-22 19:05     ` Colin Walters
  0 siblings, 0 replies; 17+ messages in thread
From: Colin Walters @ 2016-11-22 19:05 UTC (permalink / raw)
  To: Omar Sandoval, linux-fsdevel; +Cc: linux-api, kernel-team, Xi Wang

On Tue, Nov 22, 2016, at 03:25 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> One of the most common uses of temporary files is the classic atomic
> replacement pattern, i.e.,
> 
> - write temporary file
> - fsync temporary file
> - rename temporary file over real file
> - fsync parent directory
> 
> Now, we have O_TMPFILE, which gives us a much better way to create
> temporary files, but it's not possible to use it for this pattern.

FWIW, I have some userspace code that indeed wants exactly this:

https://github.com/GNOME/libglnx/blob/36396b49ad6636c9959f3dfac5e04d41584b1a92/glnx-fdio.c#L232

So if we add this, I'd presumably change the code to try it and
fall back on EINVAL as usual?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/3] fs: add AT_REPLACE flag for linkat() which replaces the target
@ 2016-11-22 19:05     ` Colin Walters
  0 siblings, 0 replies; 17+ messages in thread
From: Colin Walters @ 2016-11-22 19:05 UTC (permalink / raw)
  To: Omar Sandoval, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Xi Wang

On Tue, Nov 22, 2016, at 03:25 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
> 
> One of the most common uses of temporary files is the classic atomic
> replacement pattern, i.e.,
> 
> - write temporary file
> - fsync temporary file
> - rename temporary file over real file
> - fsync parent directory
> 
> Now, we have O_TMPFILE, which gives us a much better way to create
> temporary files, but it's not possible to use it for this pattern.

FWIW, I have some userspace code that indeed wants exactly this:

https://github.com/GNOME/libglnx/blob/36396b49ad6636c9959f3dfac5e04d41584b1a92/glnx-fdio.c#L232

So if we add this, I'd presumably change the code to try it and
fall back on EINVAL as usual?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-23  3:40     ` Al Viro
  0 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2016-11-23  3:40 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-fsdevel, linux-api, kernel-team, Linus Torvalds

On Tue, Nov 22, 2016 at 12:25:02AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Changes the inode associated with a dentry. This'll be useful for
> implementations of linkat() AT_REPLACE.

Hard NAK.  That violates all kinds of assumptions made by VFS and
filesystems alike; never, ever do that.  If you have a reference to
a positive dentry, inode should *NEVER* change.

If it unhashed the old dentry, created a new one and attached inode to
it, it _might_ have a chance.  I'm less than sure it's a good idea, but
it this form it's a non-starter.

Again,

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

and don't bring it back in that form.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-23  3:40     ` Al Viro
  0 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2016-11-23  3:40 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Linus Torvalds

On Tue, Nov 22, 2016 at 12:25:02AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
> 
> Changes the inode associated with a dentry. This'll be useful for
> implementations of linkat() AT_REPLACE.

Hard NAK.  That violates all kinds of assumptions made by VFS and
filesystems alike; never, ever do that.  If you have a reference to
a positive dentry, inode should *NEVER* change.

If it unhashed the old dentry, created a new one and attached inode to
it, it _might_ have a chance.  I'm less than sure it's a good idea, but
it this form it's a non-starter.

Again,

NAKed-by: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

and don't bring it back in that form.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-23  3:57       ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-23  3:57 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-api, kernel-team, Linus Torvalds

On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
> On Tue, Nov 22, 2016 at 12:25:02AM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Changes the inode associated with a dentry. This'll be useful for
> > implementations of linkat() AT_REPLACE.
> 
> Hard NAK.  That violates all kinds of assumptions made by VFS and
> filesystems alike; never, ever do that.  If you have a reference to
> a positive dentry, inode should *NEVER* change.

Okay, thanks for the sanity check.

> If it unhashed the old dentry, created a new one and attached inode to
> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
> it this form it's a non-starter.

One thing I considered was having the filesystem unhash the dentry and
just letting the next lookup that comes along instantiate the new one.
Is that better or worse than doing something like your suggestion?

> Again,
> 
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> and don't bring it back in that form.

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-23  3:57       ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-11-23  3:57 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Linus Torvalds

On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
> On Tue, Nov 22, 2016 at 12:25:02AM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov-b10kYP2dOMg@public.gmane.org>
> > 
> > Changes the inode associated with a dentry. This'll be useful for
> > implementations of linkat() AT_REPLACE.
> 
> Hard NAK.  That violates all kinds of assumptions made by VFS and
> filesystems alike; never, ever do that.  If you have a reference to
> a positive dentry, inode should *NEVER* change.

Okay, thanks for the sanity check.

> If it unhashed the old dentry, created a new one and attached inode to
> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
> it this form it's a non-starter.

One thing I considered was having the filesystem unhash the dentry and
just letting the next lookup that comes along instantiate the new one.
Is that better or worse than doing something like your suggestion?

> Again,
> 
> NAKed-by: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> 
> and don't bring it back in that form.

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
  2016-11-23  3:57       ` Omar Sandoval
@ 2016-11-23 16:29         ` Linus Torvalds
  -1 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2016-11-23 16:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Al Viro, linux-fsdevel, Linux API, kernel-team

On Tue, Nov 22, 2016 at 7:57 PM, Omar Sandoval <osandov@osandov.com> wrote:
> On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
>
>> If it unhashed the old dentry, created a new one and attached inode to
>> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
>> it this form it's a non-starter.
>
> One thing I considered was having the filesystem unhash the dentry and
> just letting the next lookup that comes along instantiate the new one.
> Is that better or worse than doing something like your suggestion?

I don't have the background for why you want this, but the two
approaches should be equivalent.

However, it's not a safe operation in the general case, since the
low-level filesystem may depend on the single unique dentry meaning
that operations on one particular filename are serialized and that the
dentry is unique, and your "unhash and create new" would leave old
users with a stale dentry that is no longer "unique" in that filename.
So you certainly cannot do even that kind of "d_replace()" in some
general situation.

An example of that kind of situation is the whole "d_in_lookup()"
where we use the dentry itself to guarantee uniqueness while possibly
looking up multiple entries in parallell in the same directory.

So for some particular filesystem, under some very particular
situations, such a d_replace() may be valid. But without seeing the
background, it's hard to tell. Apparently this was discussed on the
fsdevel list that google doesn't even index, and looking at the
fsdevel archives is a pain. Looks like it's AT_REPLACE for linkat().

In that context it superficially looks ok to me (ie it's
filesystem-controlled and done only when we've serialized the
directory for the link() operation anyway).  But I didn't think about
it _that_ much.

                   Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-11-23 16:29         ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2016-11-23 16:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Al Viro, linux-fsdevel, Linux API, kernel-team

On Tue, Nov 22, 2016 at 7:57 PM, Omar Sandoval <osandov-nWWhXC5lh1RBDgjK7y7TUQ@public.gmane.org> wrote:
> On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
>
>> If it unhashed the old dentry, created a new one and attached inode to
>> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
>> it this form it's a non-starter.
>
> One thing I considered was having the filesystem unhash the dentry and
> just letting the next lookup that comes along instantiate the new one.
> Is that better or worse than doing something like your suggestion?

I don't have the background for why you want this, but the two
approaches should be equivalent.

However, it's not a safe operation in the general case, since the
low-level filesystem may depend on the single unique dentry meaning
that operations on one particular filename are serialized and that the
dentry is unique, and your "unhash and create new" would leave old
users with a stale dentry that is no longer "unique" in that filename.
So you certainly cannot do even that kind of "d_replace()" in some
general situation.

An example of that kind of situation is the whole "d_in_lookup()"
where we use the dentry itself to guarantee uniqueness while possibly
looking up multiple entries in parallell in the same directory.

So for some particular filesystem, under some very particular
situations, such a d_replace() may be valid. But without seeing the
background, it's hard to tell. Apparently this was discussed on the
fsdevel list that google doesn't even index, and looking at the
fsdevel archives is a pain. Looks like it's AT_REPLACE for linkat().

In that context it superficially looks ok to me (ie it's
filesystem-controlled and done only when we've serialized the
directory for the link() operation anyway).  But I didn't think about
it _that_ much.

                   Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-12-02  1:10           ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-12-02  1:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux API, kernel-team

On Wed, Nov 23, 2016 at 08:29:57AM -0800, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 7:57 PM, Omar Sandoval <osandov@osandov.com> wrote:
> > On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
> >
> >> If it unhashed the old dentry, created a new one and attached inode to
> >> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
> >> it this form it's a non-starter.
> >
> > One thing I considered was having the filesystem unhash the dentry and
> > just letting the next lookup that comes along instantiate the new one.
> > Is that better or worse than doing something like your suggestion?
> 
> I don't have the background for why you want this, but the two
> approaches should be equivalent.
> 
> However, it's not a safe operation in the general case, since the
> low-level filesystem may depend on the single unique dentry meaning
> that operations on one particular filename are serialized and that the
> dentry is unique, and your "unhash and create new" would leave old
> users with a stale dentry that is no longer "unique" in that filename.
> So you certainly cannot do even that kind of "d_replace()" in some
> general situation.
> 
> An example of that kind of situation is the whole "d_in_lookup()"
> where we use the dentry itself to guarantee uniqueness while possibly
> looking up multiple entries in parallell in the same directory.

Thanks, this was helpful. As you mentioned below, since this is for
linkat(), we're serialized on i_rwsem, so this particular case should be
fine. But if there's something that tries to serialize on the dentry
itself without holding i_rwsem at least shared, then this would
definitely be wrong. Does such a thing exist?

> So for some particular filesystem, under some very particular
> situations, such a d_replace() may be valid. But without seeing the
> background, it's hard to tell. Apparently this was discussed on the
> fsdevel list that google doesn't even index, and looking at the
> fsdevel archives is a pain. Looks like it's AT_REPLACE for linkat().

That's right, here's the thread:

[RFC PATCH 0/3] http://marc.info/?t=147980325700004&r=1&w=2
[RFC PATCH 1/3] http://marc.info/?t=147980325700006&r=1&w=2
[RFC PATCH 2/3] http://marc.info/?t=147980325700005&r=1&w=2
[RFC PATCH 3/3] http://marc.info/?t=147980325700007&r=1&w=2

(Man, I miss gmane.) I'll cc lkml on the next submission.

> In that context it superficially looks ok to me (ie it's
> filesystem-controlled and done only when we've serialized the
> directory for the link() operation anyway).  But I didn't think about
> it _that_ much.
> 
>                    Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3] vfs: add d_replace()
@ 2016-12-02  1:10           ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2016-12-02  1:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux API, kernel-team

On Wed, Nov 23, 2016 at 08:29:57AM -0800, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 7:57 PM, Omar Sandoval <osandov-nWWhXC5lh1RBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Wed, Nov 23, 2016 at 03:40:04AM +0000, Al Viro wrote:
> >
> >> If it unhashed the old dentry, created a new one and attached inode to
> >> it, it _might_ have a chance.  I'm less than sure it's a good idea, but
> >> it this form it's a non-starter.
> >
> > One thing I considered was having the filesystem unhash the dentry and
> > just letting the next lookup that comes along instantiate the new one.
> > Is that better or worse than doing something like your suggestion?
> 
> I don't have the background for why you want this, but the two
> approaches should be equivalent.
> 
> However, it's not a safe operation in the general case, since the
> low-level filesystem may depend on the single unique dentry meaning
> that operations on one particular filename are serialized and that the
> dentry is unique, and your "unhash and create new" would leave old
> users with a stale dentry that is no longer "unique" in that filename.
> So you certainly cannot do even that kind of "d_replace()" in some
> general situation.
> 
> An example of that kind of situation is the whole "d_in_lookup()"
> where we use the dentry itself to guarantee uniqueness while possibly
> looking up multiple entries in parallell in the same directory.

Thanks, this was helpful. As you mentioned below, since this is for
linkat(), we're serialized on i_rwsem, so this particular case should be
fine. But if there's something that tries to serialize on the dentry
itself without holding i_rwsem at least shared, then this would
definitely be wrong. Does such a thing exist?

> So for some particular filesystem, under some very particular
> situations, such a d_replace() may be valid. But without seeing the
> background, it's hard to tell. Apparently this was discussed on the
> fsdevel list that google doesn't even index, and looking at the
> fsdevel archives is a pain. Looks like it's AT_REPLACE for linkat().

That's right, here's the thread:

[RFC PATCH 0/3] http://marc.info/?t=147980325700004&r=1&w=2
[RFC PATCH 1/3] http://marc.info/?t=147980325700006&r=1&w=2
[RFC PATCH 2/3] http://marc.info/?t=147980325700005&r=1&w=2
[RFC PATCH 3/3] http://marc.info/?t=147980325700007&r=1&w=2

(Man, I miss gmane.) I'll cc lkml on the next submission.

> In that context it superficially looks ok to me (ie it's
> filesystem-controlled and done only when we've serialized the
> directory for the link() operation anyway).  But I didn't think about
> it _that_ much.
> 
>                    Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-12-02  1:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-22  8:25 [RFC PATCH 0/3] fs: add AT_REPLACE flag for linkat() Omar Sandoval
2016-11-22  8:25 ` [RFC PATCH 1/3] fs: add AT_REPLACE flag for linkat() which replaces the target Omar Sandoval
2016-11-22  8:25   ` Omar Sandoval
2016-11-22 19:05   ` Colin Walters
2016-11-22 19:05     ` Colin Walters
2016-11-22  8:25 ` [RFC PATCH 2/3] vfs: add d_replace() Omar Sandoval
2016-11-22  8:25   ` Omar Sandoval
2016-11-23  3:40   ` Al Viro
2016-11-23  3:40     ` Al Viro
2016-11-23  3:57     ` Omar Sandoval
2016-11-23  3:57       ` Omar Sandoval
2016-11-23 16:29       ` Linus Torvalds
2016-11-23 16:29         ` Linus Torvalds
2016-12-02  1:10         ` Omar Sandoval
2016-12-02  1:10           ` Omar Sandoval
2016-11-22  8:25 ` [RFC PATCH 3/3] Btrfs: add support for linkat() AT_REPLACE Omar Sandoval
2016-11-22  8:25   ` Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.