linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
@ 2018-05-10 18:20 Al Viro
  2018-05-10 19:11 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Al Viro @ 2018-05-10 18:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linus Torvalds

[in the spirit of "don't put 'em in without posting for review; the
this is present in vfs.git#for-linus, if you prefer to look in git.

Background: a bunch of nfsd races fixes from back in 2008 had
problems with lockdep enabled; in 2012 that got "fixed", unfortunately
reopening a narrow race window.  The patch below does *NOT* fix
all filesystems, but it does fix most of the exported local ones
and it is easy to backport, so it makes for a sane starting point.

If anyone has objections, this is your chance to yell.
]

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org   # 2.6.29 and later
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e064c49c9a9a..9e97cbb4f006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6575,8 +6575,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 	} else {
 		btrfs_update_inode(trans, root, inode);
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 	}
 
 out_unlock:
@@ -6652,8 +6651,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 
 	BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 out_unlock:
 	btrfs_end_transaction(trans);
@@ -6798,12 +6796,7 @@ static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (err)
 		goto out_fail_inode;
 
-	d_instantiate(dentry, inode);
-	/*
-	 * mkdir is special.  We're unlocking after we call d_instantiate
-	 * to avoid a race with nfsd calling d_instantiate.
-	 */
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	drop_on_err = 0;
 
 out_fail:
@@ -10246,8 +10239,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 out_unlock:
 	btrfs_end_transaction(trans);
diff --git a/fs/dcache.c b/fs/dcache.c
index 86d2de63461e..6da095fef440 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1899,6 +1899,22 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 }
 EXPORT_SYMBOL(d_instantiate);
 
+void d_instantiate_new(struct dentry *entry, struct inode *inode)
+{
+	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!inode);
+	lockdep_annotate_inode_mutex_key(inode);
+	security_d_instantiate(entry, inode);
+	spin_lock(&inode->i_lock);
+	__d_instantiate(entry, inode);
+	WARN_ON(!(inode->i_state & I_NEW));
+	inode->i_state &= ~I_NEW;
+	smp_mb();
+	wake_up_bit(&inode->i_state, __I_NEW);
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL(d_instantiate_new);
+
 /**
  * d_instantiate_no_diralias - instantiate a non-aliased dentry
  * @entry: dentry to complete
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 847904aa63a9..7bba8f2693b2 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -283,8 +283,7 @@ ecryptfs_create(struct inode *directory_inode, struct dentry *ecryptfs_dentry,
 		iget_failed(ecryptfs_inode);
 		goto out;
 	}
-	unlock_new_inode(ecryptfs_inode);
-	d_instantiate(ecryptfs_dentry, ecryptfs_inode);
+	d_instantiate_new(ecryptfs_dentry, ecryptfs_inode);
 out:
 	return rc;
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 55f7caadb093..152453a91877 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -41,8 +41,7 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
 	int err = ext2_add_link(dentry, inode);
 	if (!err) {
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -255,8 +254,7 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 	if (err)
 		goto out_fail;
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 out:
 	return err;
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b1f21e3a0763..4a09063ce1d2 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2411,8 +2411,7 @@ static int ext4_add_nondir(handle_t *handle,
 	int err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
 		ext4_mark_inode_dirty(handle, inode);
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	drop_nlink(inode);
@@ -2651,8 +2650,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	err = ext4_mark_inode_dirty(handle, dir);
 	if (err)
 		goto out_clear_inode;
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	if (IS_DIRSYNC(dir))
 		ext4_handle_sync(handle);
 
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index d5098efe577c..75e37fd720b2 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -294,8 +294,7 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 
 	alloc_nid_done(sbi, ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
@@ -597,8 +596,7 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry,
 	err = page_symlink(inode, disk_link.name, disk_link.len);
 
 err_out:
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	/*
 	 * Let's flush symlink data in order to avoid broken symlink as much as
@@ -661,8 +659,7 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	alloc_nid_done(sbi, inode->i_ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
@@ -713,8 +710,7 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry,
 
 	alloc_nid_done(sbi, inode->i_ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 0a754f38462e..e5a6deb38e1e 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -209,8 +209,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry *dentry,
 		  __func__, inode->i_ino, inode->i_mode, inode->i_nlink,
 		  f->inocache->pino_nlink, inode->i_mapping->nrpages);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -430,8 +429,7 @@ static int jffs2_symlink (struct inode *dir_i, struct dentry *dentry, const char
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -575,8 +573,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -747,8 +744,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index b41596d71858..56c3fcbfe80e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -178,8 +178,7 @@ static int jfs_create(struct inode *dip, struct dentry *dentry, umode_t mode,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -313,8 +312,7 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode)
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -1059,8 +1057,7 @@ static int jfs_symlink(struct inode *dip, struct dentry *dentry,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -1447,8 +1444,7 @@ static int jfs_mknod(struct inode *dir, struct dentry *dentry,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out1:
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 1a2894aa0194..dd52d3f82e8d 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -46,8 +46,7 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
 	int err = nilfs_add_link(dentry, inode);
 
 	if (!err) {
-		d_instantiate(dentry, inode);
-		unlock_new_inode(inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -243,8 +242,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		goto out_fail;
 
 	nilfs_mark_inode_dirty(inode);
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 out:
 	if (!err)
 		err = nilfs_transaction_commit(dir->i_sb);
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index 6e3134e6d98a..1b5707c44c3f 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -75,8 +75,7 @@ static int orangefs_create(struct inode *dir,
 		     get_khandle_from_ino(inode),
 		     dentry);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
@@ -332,8 +331,7 @@ static int orangefs_symlink(struct inode *dir,
 		     "Assigned symlink inode new number of %pU\n",
 		     get_khandle_from_ino(inode));
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
@@ -402,8 +400,7 @@ static int orangefs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
 		     "Assigned dir inode new number of %pU\n",
 		     get_khandle_from_ino(inode));
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index bd39a998843d..5089dac02660 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -687,8 +687,7 @@ static int reiserfs_create(struct inode *dir, struct dentry *dentry, umode_t mod
 	reiserfs_update_inode_transaction(inode);
 	reiserfs_update_inode_transaction(dir);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 
 out_failed:
@@ -771,8 +770,7 @@ static int reiserfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode
 		goto out_failed;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 
 out_failed:
@@ -871,8 +869,7 @@ static int reiserfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
 	/* the above add_entry did not update dir's stat data */
 	reiserfs_update_sd(&th, dir);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 out_failed:
 	reiserfs_write_unlock(dir->i_sb);
@@ -1187,8 +1184,7 @@ static int reiserfs_symlink(struct inode *parent_dir,
 		goto out_failed;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 out_failed:
 	reiserfs_write_unlock(parent_dir->i_sb);
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 0458dd47e105..c586026508db 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -622,8 +622,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
 	if (fibh.sbh != fibh.ebh)
 		brelse(fibh.ebh);
 	brelse(fibh.sbh);
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 	return 0;
 }
@@ -733,8 +732,7 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	inc_nlink(dir);
 	dir->i_ctime = dir->i_mtime = current_time(dir);
 	mark_inode_dirty(dir);
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	if (fibh.sbh != fibh.ebh)
 		brelse(fibh.ebh);
 	brelse(fibh.sbh);
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 32545cd00ceb..d5f43ba76c59 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -39,8 +39,7 @@ static inline int ufs_add_nondir(struct dentry *dentry, struct inode *inode)
 {
 	int err = ufs_add_link(dentry, inode);
 	if (!err) {
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -193,8 +192,7 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 	if (err)
 		goto out_fail;
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
 out_fail:
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 94acbde17bb1..66c6e17e61e5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -224,6 +224,7 @@ extern seqlock_t rename_lock;
  * These are the low-level FS interfaces to the dcache..
  */
 extern void d_instantiate(struct dentry *, struct inode *);
+extern void d_instantiate_new(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_anon(struct dentry *, struct inode *);
 extern int d_instantiate_no_diralias(struct dentry *, struct inode *);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-10 18:20 [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely Al Viro
@ 2018-05-10 19:11 ` Andreas Dilger
  2018-05-10 19:32   ` Al Viro
  2018-05-10 20:44 ` Mike Marshall
  2018-05-10 22:56 ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2018-05-10 19:11 UTC (permalink / raw)
  To: Al Viro; +Cc: fsdevel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 15733 bytes --]

On May 10, 2018, at 12:20 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> [in the spirit of "don't put 'em in without posting for review; the
> this is present in vfs.git#for-linus, if you prefer to look in git.
> 
> Background: a bunch of nfsd races fixes from back in 2008 had
> problems with lockdep enabled; in 2012 that got "fixed", unfortunately
> reopening a narrow race window.  The patch below does *NOT* fix
> all filesystems, but it does fix most of the exported local ones
> and it is easy to backport, so it makes for a sane starting point.
> 
> If anyone has objections, this is your chance to yell.
> ]
> 
> For anything NFS-exported we do _not_ want to unlock new inode
> before it has grown an alias; original set of fixes got the
> ordering right, but missed the nasty complication in case of
> lockdep being enabled - unlock_new_inode() does
>    lockdep_annotate_inode_mutex_key(inode)
> which can only be done before anyone gets a chance to touch
> ->i_mutex.  Unfortunately, flipping the order and doing
> unlock_new_inode() before d_instantiate() opens a window when
> mkdir can race with open-by-fhandle on a guessed fhandle, leading
> to multiple aliases for a directory inode and all the breakage
> that follows from that.
> 
>    Correct solution: a new primitive (d_instantiate_new())
> combining these two in the right order - lockdep annotate, then
> d_instantiate(), then the rest of unlock_new_inode().  All
> combinations of d_instantiate() with unlock_new_inode() should
> be converted to that.
> 
> Cc: stable@kernel.org   # 2.6.29 and later
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 86d2de63461e..6da095fef440 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1899,6 +1899,22 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
> }
> EXPORT_SYMBOL(d_instantiate);
> 
> +void d_instantiate_new(struct dentry *entry, struct inode *inode)

Is it worthwhile to add a comment here that this is a merger of
d_instantiate() and unlock_inode_new() (and possibly at those
functions as well) so that any future changes are added to both
sets of functions?

Alternately, this could be refactored a bit, but not much:

static void __unlock_new_inode(struct inode *inode)
{
        WARN_ON(!(inode->i_state & I_NEW));
        inode->i_state &= ~I_NEW;
        smp_mb();
        wake_up_bit(&inode->i_state, __I_NEW);
}

void unlock_new_inode(struct inode *inode)
{
        lockdep_annotate_inode_mutex_key(inode);
        spin_lock(&inode->i_lock);
	__unlock_new_inode(inode);
        spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(unlock_new_inode);

void d_instantiate_new(struct dentry *entry, struct inode *inode)
{
	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
	BUG_ON(!inode);
	lockdep_annotate_inode_mutex_key(inode);
	security_d_instantiate(entry, inode);
	spin_lock(&inode->i_lock);
	__d_instantiate(entry, inode);
	__unlock_new_inode(inode);
	spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_instantiate_new);

Cheers, Andreas

> +{
> +	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> +	BUG_ON(!inode);
> +	lockdep_annotate_inode_mutex_key(inode);
> +	security_d_instantiate(entry, inode);
> +	spin_lock(&inode->i_lock);
> +	__d_instantiate(entry, inode);
> +	WARN_ON(!(inode->i_state & I_NEW));
> +	inode->i_state &= ~I_NEW;
> +	smp_mb();
> +	wake_up_bit(&inode->i_state, __I_NEW);
> +	spin_unlock(&inode->i_lock);
> +}
> +EXPORT_SYMBOL(d_instantiate_new);
> +
> /**
>  * d_instantiate_no_diralias - instantiate a non-aliased dentry
>  * @entry: dentry to complete
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index 847904aa63a9..7bba8f2693b2 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -283,8 +283,7 @@ ecryptfs_create(struct inode *directory_inode, struct dentry *ecryptfs_dentry,
> 		iget_failed(ecryptfs_inode);
> 		goto out;
> 	}
> -	unlock_new_inode(ecryptfs_inode);
> -	d_instantiate(ecryptfs_dentry, ecryptfs_inode);
> +	d_instantiate_new(ecryptfs_dentry, ecryptfs_inode);
> out:
> 	return rc;
> }
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 55f7caadb093..152453a91877 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -41,8 +41,7 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
> {
> 	int err = ext2_add_link(dentry, inode);
> 	if (!err) {
> -		unlock_new_inode(inode);
> -		d_instantiate(dentry, inode);
> +		d_instantiate_new(dentry, inode);
> 		return 0;
> 	}
> 	inode_dec_link_count(inode);
> @@ -255,8 +254,7 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
> 	if (err)
> 		goto out_fail;
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> out:
> 	return err;
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index b1f21e3a0763..4a09063ce1d2 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2411,8 +2411,7 @@ static int ext4_add_nondir(handle_t *handle,
> 	int err = ext4_add_entry(handle, dentry, inode);
> 	if (!err) {
> 		ext4_mark_inode_dirty(handle, inode);
> -		unlock_new_inode(inode);
> -		d_instantiate(dentry, inode);
> +		d_instantiate_new(dentry, inode);
> 		return 0;
> 	}
> 	drop_nlink(inode);
> @@ -2651,8 +2650,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
> 	err = ext4_mark_inode_dirty(handle, dir);
> 	if (err)
> 		goto out_clear_inode;
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	if (IS_DIRSYNC(dir))
> 		ext4_handle_sync(handle);
> 
> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
> index d5098efe577c..75e37fd720b2 100644
> --- a/fs/f2fs/namei.c
> +++ b/fs/f2fs/namei.c
> @@ -294,8 +294,7 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
> 
> 	alloc_nid_done(sbi, ino);
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 
> 	if (IS_DIRSYNC(dir))
> 		f2fs_sync_fs(sbi->sb, 1);
> @@ -597,8 +596,7 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry,
> 	err = page_symlink(inode, disk_link.name, disk_link.len);
> 
> err_out:
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 
> 	/*
> 	 * Let's flush symlink data in order to avoid broken symlink as much as
> @@ -661,8 +659,7 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
> 
> 	alloc_nid_done(sbi, inode->i_ino);
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 
> 	if (IS_DIRSYNC(dir))
> 		f2fs_sync_fs(sbi->sb, 1);
> @@ -713,8 +710,7 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry,
> 
> 	alloc_nid_done(sbi, inode->i_ino);
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 
> 	if (IS_DIRSYNC(dir))
> 		f2fs_sync_fs(sbi->sb, 1);
> diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> index 0a754f38462e..e5a6deb38e1e 100644
> --- a/fs/jffs2/dir.c
> +++ b/fs/jffs2/dir.c
> @@ -209,8 +209,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry *dentry,
> 		  __func__, inode->i_ino, inode->i_mode, inode->i_nlink,
> 		  f->inocache->pino_nlink, inode->i_mapping->nrpages);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	return 0;
> 
>  fail:
> @@ -430,8 +429,7 @@ static int jffs2_symlink (struct inode *dir_i, struct dentry *dentry, const char
> 	mutex_unlock(&dir_f->sem);
> 	jffs2_complete_reservation(c);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	return 0;
> 
>  fail:
> @@ -575,8 +573,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode
> 	mutex_unlock(&dir_f->sem);
> 	jffs2_complete_reservation(c);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	return 0;
> 
>  fail:
> @@ -747,8 +744,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
> 	mutex_unlock(&dir_f->sem);
> 	jffs2_complete_reservation(c);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	return 0;
> 
>  fail:
> diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
> index b41596d71858..56c3fcbfe80e 100644
> --- a/fs/jfs/namei.c
> +++ b/fs/jfs/namei.c
> @@ -178,8 +178,7 @@ static int jfs_create(struct inode *dip, struct dentry *dentry, umode_t mode,
> 		unlock_new_inode(ip);
> 		iput(ip);
> 	} else {
> -		unlock_new_inode(ip);
> -		d_instantiate(dentry, ip);
> +		d_instantiate_new(dentry, ip);
> 	}
> 
>       out2:
> @@ -313,8 +312,7 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode)
> 		unlock_new_inode(ip);
> 		iput(ip);
> 	} else {
> -		unlock_new_inode(ip);
> -		d_instantiate(dentry, ip);
> +		d_instantiate_new(dentry, ip);
> 	}
> 
>       out2:
> @@ -1059,8 +1057,7 @@ static int jfs_symlink(struct inode *dip, struct dentry *dentry,
> 		unlock_new_inode(ip);
> 		iput(ip);
> 	} else {
> -		unlock_new_inode(ip);
> -		d_instantiate(dentry, ip);
> +		d_instantiate_new(dentry, ip);
> 	}
> 
>       out2:
> @@ -1447,8 +1444,7 @@ static int jfs_mknod(struct inode *dir, struct dentry *dentry,
> 		unlock_new_inode(ip);
> 		iput(ip);
> 	} else {
> -		unlock_new_inode(ip);
> -		d_instantiate(dentry, ip);
> +		d_instantiate_new(dentry, ip);
> 	}
> 
>       out1:
> diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
> index 1a2894aa0194..dd52d3f82e8d 100644
> --- a/fs/nilfs2/namei.c
> +++ b/fs/nilfs2/namei.c
> @@ -46,8 +46,7 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
> 	int err = nilfs_add_link(dentry, inode);
> 
> 	if (!err) {
> -		d_instantiate(dentry, inode);
> -		unlock_new_inode(inode);
> +		d_instantiate_new(dentry, inode);
> 		return 0;
> 	}
> 	inode_dec_link_count(inode);
> @@ -243,8 +242,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
> 		goto out_fail;
> 
> 	nilfs_mark_inode_dirty(inode);
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> out:
> 	if (!err)
> 		err = nilfs_transaction_commit(dir->i_sb);
> diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
> index 6e3134e6d98a..1b5707c44c3f 100644
> --- a/fs/orangefs/namei.c
> +++ b/fs/orangefs/namei.c
> @@ -75,8 +75,7 @@ static int orangefs_create(struct inode *dir,
> 		     get_khandle_from_ino(inode),
> 		     dentry);
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 	orangefs_set_timeout(dentry);
> 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
> 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> @@ -332,8 +331,7 @@ static int orangefs_symlink(struct inode *dir,
> 		     "Assigned symlink inode new number of %pU\n",
> 		     get_khandle_from_ino(inode));
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 	orangefs_set_timeout(dentry);
> 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
> 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> @@ -402,8 +400,7 @@ static int orangefs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
> 		     "Assigned dir inode new number of %pU\n",
> 		     get_khandle_from_ino(inode));
> 
> -	d_instantiate(dentry, inode);
> -	unlock_new_inode(inode);
> +	d_instantiate_new(dentry, inode);
> 	orangefs_set_timeout(dentry);
> 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
> 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
> index bd39a998843d..5089dac02660 100644
> --- a/fs/reiserfs/namei.c
> +++ b/fs/reiserfs/namei.c
> @@ -687,8 +687,7 @@ static int reiserfs_create(struct inode *dir, struct dentry *dentry, umode_t mod
> 	reiserfs_update_inode_transaction(inode);
> 	reiserfs_update_inode_transaction(dir);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	retval = journal_end(&th);
> 
> out_failed:
> @@ -771,8 +770,7 @@ static int reiserfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode
> 		goto out_failed;
> 	}
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	retval = journal_end(&th);
> 
> out_failed:
> @@ -871,8 +869,7 @@ static int reiserfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
> 	/* the above add_entry did not update dir's stat data */
> 	reiserfs_update_sd(&th, dir);
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	retval = journal_end(&th);
> out_failed:
> 	reiserfs_write_unlock(dir->i_sb);
> @@ -1187,8 +1184,7 @@ static int reiserfs_symlink(struct inode *parent_dir,
> 		goto out_failed;
> 	}
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	retval = journal_end(&th);
> out_failed:
> 	reiserfs_write_unlock(parent_dir->i_sb);
> diff --git a/fs/udf/namei.c b/fs/udf/namei.c
> index 0458dd47e105..c586026508db 100644
> --- a/fs/udf/namei.c
> +++ b/fs/udf/namei.c
> @@ -622,8 +622,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
> 	if (fibh.sbh != fibh.ebh)
> 		brelse(fibh.ebh);
> 	brelse(fibh.sbh);
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 
> 	return 0;
> }
> @@ -733,8 +732,7 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
> 	inc_nlink(dir);
> 	dir->i_ctime = dir->i_mtime = current_time(dir);
> 	mark_inode_dirty(dir);
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	if (fibh.sbh != fibh.ebh)
> 		brelse(fibh.ebh);
> 	brelse(fibh.sbh);
> diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
> index 32545cd00ceb..d5f43ba76c59 100644
> --- a/fs/ufs/namei.c
> +++ b/fs/ufs/namei.c
> @@ -39,8 +39,7 @@ static inline int ufs_add_nondir(struct dentry *dentry, struct inode *inode)
> {
> 	int err = ufs_add_link(dentry, inode);
> 	if (!err) {
> -		unlock_new_inode(inode);
> -		d_instantiate(dentry, inode);
> +		d_instantiate_new(dentry, inode);
> 		return 0;
> 	}
> 	inode_dec_link_count(inode);
> @@ -193,8 +192,7 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
> 	if (err)
> 		goto out_fail;
> 
> -	unlock_new_inode(inode);
> -	d_instantiate(dentry, inode);
> +	d_instantiate_new(dentry, inode);
> 	return 0;
> 
> out_fail:
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 94acbde17bb1..66c6e17e61e5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -224,6 +224,7 @@ extern seqlock_t rename_lock;
>  * These are the low-level FS interfaces to the dcache..
>  */
> extern void d_instantiate(struct dentry *, struct inode *);
> +extern void d_instantiate_new(struct dentry *, struct inode *);
> extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
> extern struct dentry * d_instantiate_anon(struct dentry *, struct inode *);
> extern int d_instantiate_no_diralias(struct dentry *, struct inode *);


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-10 19:11 ` Andreas Dilger
@ 2018-05-10 19:32   ` Al Viro
  0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2018-05-10 19:32 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: fsdevel, Linus Torvalds

On Thu, May 10, 2018 at 01:11:00PM -0600, Andreas Dilger wrote:

> > +void d_instantiate_new(struct dentry *entry, struct inode *inode)
> 
> Is it worthwhile to add a comment here that this is a merger of
> d_instantiate() and unlock_inode_new() (and possibly at those
> functions as well) so that any future changes are added to both
> sets of functions?

FWIW, I hope to reduce the number of d_instantiate() callers.
As it is, it's a serious headache - any use of d_instantiate()
needs to be reviewed for correctness, and the more I'm looking
into it (this is far from the only problem), the more it feels
like a case of bad calling conventions...

Anyway, comment re "this should be equivalent to d_instantiate()+
unlock_new_inode(), with lockdep bits of the latter done before
anything else" makes sense for backports.

> Alternately, this could be refactored a bit, but not much:
> 
> static void __unlock_new_inode(struct inode *inode)
> {
>         WARN_ON(!(inode->i_state & I_NEW));
>         inode->i_state &= ~I_NEW;
>         smp_mb();
>         wake_up_bit(&inode->i_state, __I_NEW);
> }
> 
> void unlock_new_inode(struct inode *inode)
> {
>         lockdep_annotate_inode_mutex_key(inode);
>         spin_lock(&inode->i_lock);
> 	__unlock_new_inode(inode);
>         spin_unlock(&inode->i_lock);
> }
> EXPORT_SYMBOL(unlock_new_inode);
> 
> void d_instantiate_new(struct dentry *entry, struct inode *inode)
> {
> 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> 	BUG_ON(!inode);
> 	lockdep_annotate_inode_mutex_key(inode);
> 	security_d_instantiate(entry, inode);
> 	spin_lock(&inode->i_lock);
> 	__d_instantiate(entry, inode);
> 	__unlock_new_inode(inode);
> 	spin_unlock(&inode->i_lock);
> }
> EXPORT_SYMBOL(d_instantiate_new);

I thought of that, but then we'd have to make __d_instantiate() seen
outside of fs/dcache.c ;-/

Anyway, update variant of the patch follows:

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e064c49c9a9a..9e97cbb4f006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6575,8 +6575,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 	} else {
 		btrfs_update_inode(trans, root, inode);
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 	}
 
 out_unlock:
@@ -6652,8 +6651,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 
 	BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 out_unlock:
 	btrfs_end_transaction(trans);
@@ -6798,12 +6796,7 @@ static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (err)
 		goto out_fail_inode;
 
-	d_instantiate(dentry, inode);
-	/*
-	 * mkdir is special.  We're unlocking after we call d_instantiate
-	 * to avoid a race with nfsd calling d_instantiate.
-	 */
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	drop_on_err = 0;
 
 out_fail:
@@ -10246,8 +10239,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 		goto out_unlock_inode;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 out_unlock:
 	btrfs_end_transaction(trans);
diff --git a/fs/dcache.c b/fs/dcache.c
index 86d2de63461e..2acfc69878f5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1899,6 +1899,28 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 }
 EXPORT_SYMBOL(d_instantiate);
 
+/*
+ * This should be equivalent to d_instantiate() + unlock_new_inode(),
+ * with lockdep-related part of unlock_new_inode() done before
+ * anything else.  Use that instead of open-coding d_instantiate()/
+ * unlock_new_inode() combinations.
+ */
+void d_instantiate_new(struct dentry *entry, struct inode *inode)
+{
+	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!inode);
+	lockdep_annotate_inode_mutex_key(inode);
+	security_d_instantiate(entry, inode);
+	spin_lock(&inode->i_lock);
+	__d_instantiate(entry, inode);
+	WARN_ON(!(inode->i_state & I_NEW));
+	inode->i_state &= ~I_NEW;
+	smp_mb();
+	wake_up_bit(&inode->i_state, __I_NEW);
+	spin_unlock(&inode->i_lock);
+}
+EXPORT_SYMBOL(d_instantiate_new);
+
 /**
  * d_instantiate_no_diralias - instantiate a non-aliased dentry
  * @entry: dentry to complete
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 847904aa63a9..7bba8f2693b2 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -283,8 +283,7 @@ ecryptfs_create(struct inode *directory_inode, struct dentry *ecryptfs_dentry,
 		iget_failed(ecryptfs_inode);
 		goto out;
 	}
-	unlock_new_inode(ecryptfs_inode);
-	d_instantiate(ecryptfs_dentry, ecryptfs_inode);
+	d_instantiate_new(ecryptfs_dentry, ecryptfs_inode);
 out:
 	return rc;
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 55f7caadb093..152453a91877 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -41,8 +41,7 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
 	int err = ext2_add_link(dentry, inode);
 	if (!err) {
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -255,8 +254,7 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 	if (err)
 		goto out_fail;
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 out:
 	return err;
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b1f21e3a0763..4a09063ce1d2 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2411,8 +2411,7 @@ static int ext4_add_nondir(handle_t *handle,
 	int err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
 		ext4_mark_inode_dirty(handle, inode);
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	drop_nlink(inode);
@@ -2651,8 +2650,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	err = ext4_mark_inode_dirty(handle, dir);
 	if (err)
 		goto out_clear_inode;
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	if (IS_DIRSYNC(dir))
 		ext4_handle_sync(handle);
 
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index d5098efe577c..75e37fd720b2 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -294,8 +294,7 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 
 	alloc_nid_done(sbi, ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
@@ -597,8 +596,7 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry,
 	err = page_symlink(inode, disk_link.name, disk_link.len);
 
 err_out:
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	/*
 	 * Let's flush symlink data in order to avoid broken symlink as much as
@@ -661,8 +659,7 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	alloc_nid_done(sbi, inode->i_ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
@@ -713,8 +710,7 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry,
 
 	alloc_nid_done(sbi, inode->i_ino);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 
 	if (IS_DIRSYNC(dir))
 		f2fs_sync_fs(sbi->sb, 1);
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 0a754f38462e..e5a6deb38e1e 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -209,8 +209,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry *dentry,
 		  __func__, inode->i_ino, inode->i_mode, inode->i_nlink,
 		  f->inocache->pino_nlink, inode->i_mapping->nrpages);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -430,8 +429,7 @@ static int jffs2_symlink (struct inode *dir_i, struct dentry *dentry, const char
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -575,8 +573,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
@@ -747,8 +744,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
 	mutex_unlock(&dir_f->sem);
 	jffs2_complete_reservation(c);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
  fail:
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index b41596d71858..56c3fcbfe80e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -178,8 +178,7 @@ static int jfs_create(struct inode *dip, struct dentry *dentry, umode_t mode,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -313,8 +312,7 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode)
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -1059,8 +1057,7 @@ static int jfs_symlink(struct inode *dip, struct dentry *dentry,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out2:
@@ -1447,8 +1444,7 @@ static int jfs_mknod(struct inode *dir, struct dentry *dentry,
 		unlock_new_inode(ip);
 		iput(ip);
 	} else {
-		unlock_new_inode(ip);
-		d_instantiate(dentry, ip);
+		d_instantiate_new(dentry, ip);
 	}
 
       out1:
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 1a2894aa0194..dd52d3f82e8d 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -46,8 +46,7 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
 	int err = nilfs_add_link(dentry, inode);
 
 	if (!err) {
-		d_instantiate(dentry, inode);
-		unlock_new_inode(inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -243,8 +242,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		goto out_fail;
 
 	nilfs_mark_inode_dirty(inode);
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 out:
 	if (!err)
 		err = nilfs_transaction_commit(dir->i_sb);
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index 6e3134e6d98a..1b5707c44c3f 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -75,8 +75,7 @@ static int orangefs_create(struct inode *dir,
 		     get_khandle_from_ino(inode),
 		     dentry);
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
@@ -332,8 +331,7 @@ static int orangefs_symlink(struct inode *dir,
 		     "Assigned symlink inode new number of %pU\n",
 		     get_khandle_from_ino(inode));
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
@@ -402,8 +400,7 @@ static int orangefs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
 		     "Assigned dir inode new number of %pU\n",
 		     get_khandle_from_ino(inode));
 
-	d_instantiate(dentry, inode);
-	unlock_new_inode(inode);
+	d_instantiate_new(dentry, inode);
 	orangefs_set_timeout(dentry);
 	ORANGEFS_I(inode)->getattr_time = jiffies - 1;
 	ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index bd39a998843d..5089dac02660 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -687,8 +687,7 @@ static int reiserfs_create(struct inode *dir, struct dentry *dentry, umode_t mod
 	reiserfs_update_inode_transaction(inode);
 	reiserfs_update_inode_transaction(dir);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 
 out_failed:
@@ -771,8 +770,7 @@ static int reiserfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode
 		goto out_failed;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 
 out_failed:
@@ -871,8 +869,7 @@ static int reiserfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
 	/* the above add_entry did not update dir's stat data */
 	reiserfs_update_sd(&th, dir);
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 out_failed:
 	reiserfs_write_unlock(dir->i_sb);
@@ -1187,8 +1184,7 @@ static int reiserfs_symlink(struct inode *parent_dir,
 		goto out_failed;
 	}
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	retval = journal_end(&th);
 out_failed:
 	reiserfs_write_unlock(parent_dir->i_sb);
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 0458dd47e105..c586026508db 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -622,8 +622,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
 	if (fibh.sbh != fibh.ebh)
 		brelse(fibh.ebh);
 	brelse(fibh.sbh);
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 
 	return 0;
 }
@@ -733,8 +732,7 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	inc_nlink(dir);
 	dir->i_ctime = dir->i_mtime = current_time(dir);
 	mark_inode_dirty(dir);
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	if (fibh.sbh != fibh.ebh)
 		brelse(fibh.ebh);
 	brelse(fibh.sbh);
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 32545cd00ceb..d5f43ba76c59 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -39,8 +39,7 @@ static inline int ufs_add_nondir(struct dentry *dentry, struct inode *inode)
 {
 	int err = ufs_add_link(dentry, inode);
 	if (!err) {
-		unlock_new_inode(inode);
-		d_instantiate(dentry, inode);
+		d_instantiate_new(dentry, inode);
 		return 0;
 	}
 	inode_dec_link_count(inode);
@@ -193,8 +192,7 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 	if (err)
 		goto out_fail;
 
-	unlock_new_inode(inode);
-	d_instantiate(dentry, inode);
+	d_instantiate_new(dentry, inode);
 	return 0;
 
 out_fail:
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 94acbde17bb1..66c6e17e61e5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -224,6 +224,7 @@ extern seqlock_t rename_lock;
  * These are the low-level FS interfaces to the dcache..
  */
 extern void d_instantiate(struct dentry *, struct inode *);
+extern void d_instantiate_new(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_anon(struct dentry *, struct inode *);
 extern int d_instantiate_no_diralias(struct dentry *, struct inode *);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-10 18:20 [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely Al Viro
  2018-05-10 19:11 ` Andreas Dilger
@ 2018-05-10 20:44 ` Mike Marshall
  2018-05-10 22:56 ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Mike Marshall @ 2018-05-10 20:44 UTC (permalink / raw)
  To: Al Viro, Mike Marshall, Martin Brandenburg; +Cc: linux-fsdevel, Linus Torvalds

I applied your patch to Linux v4.17-rc3 and ran xfstests and saw
no Orangefs regressions.... you can add tested-by: Mike Marshall
if you dare <g> ...

-Mike

On Thu, May 10, 2018 at 2:20 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> [in the spirit of "don't put 'em in without posting for review; the
> this is present in vfs.git#for-linus, if you prefer to look in git.
>
> Background: a bunch of nfsd races fixes from back in 2008 had
> problems with lockdep enabled; in 2012 that got "fixed", unfortunately
> reopening a narrow race window.  The patch below does *NOT* fix
> all filesystems, but it does fix most of the exported local ones
> and it is easy to backport, so it makes for a sane starting point.
>
> If anyone has objections, this is your chance to yell.
> ]
>
> For anything NFS-exported we do _not_ want to unlock new inode
> before it has grown an alias; original set of fixes got the
> ordering right, but missed the nasty complication in case of
> lockdep being enabled - unlock_new_inode() does
>     lockdep_annotate_inode_mutex_key(inode)
> which can only be done before anyone gets a chance to touch
> ->i_mutex.  Unfortunately, flipping the order and doing
> unlock_new_inode() before d_instantiate() opens a window when
> mkdir can race with open-by-fhandle on a guessed fhandle, leading
> to multiple aliases for a directory inode and all the breakage
> that follows from that.
>
>     Correct solution: a new primitive (d_instantiate_new())
> combining these two in the right order - lockdep annotate, then
> d_instantiate(), then the rest of unlock_new_inode().  All
> combinations of d_instantiate() with unlock_new_inode() should
> be converted to that.
>
> Cc: stable@kernel.org   # 2.6.29 and later
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e064c49c9a9a..9e97cbb4f006 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6575,8 +6575,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry *dentry,
>                 goto out_unlock_inode;
>         } else {
>                 btrfs_update_inode(trans, root, inode);
> -               unlock_new_inode(inode);
> -               d_instantiate(dentry, inode);
> +               d_instantiate_new(dentry, inode);
>         }
>
>  out_unlock:
> @@ -6652,8 +6651,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
>                 goto out_unlock_inode;
>
>         BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>
>  out_unlock:
>         btrfs_end_transaction(trans);
> @@ -6798,12 +6796,7 @@ static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>         if (err)
>                 goto out_fail_inode;
>
> -       d_instantiate(dentry, inode);
> -       /*
> -        * mkdir is special.  We're unlocking after we call d_instantiate
> -        * to avoid a race with nfsd calling d_instantiate.
> -        */
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>         drop_on_err = 0;
>
>  out_fail:
> @@ -10246,8 +10239,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
>                 goto out_unlock_inode;
>         }
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>
>  out_unlock:
>         btrfs_end_transaction(trans);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 86d2de63461e..6da095fef440 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1899,6 +1899,22 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
>  }
>  EXPORT_SYMBOL(d_instantiate);
>
> +void d_instantiate_new(struct dentry *entry, struct inode *inode)
> +{
> +       BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> +       BUG_ON(!inode);
> +       lockdep_annotate_inode_mutex_key(inode);
> +       security_d_instantiate(entry, inode);
> +       spin_lock(&inode->i_lock);
> +       __d_instantiate(entry, inode);
> +       WARN_ON(!(inode->i_state & I_NEW));
> +       inode->i_state &= ~I_NEW;
> +       smp_mb();
> +       wake_up_bit(&inode->i_state, __I_NEW);
> +       spin_unlock(&inode->i_lock);
> +}
> +EXPORT_SYMBOL(d_instantiate_new);
> +
>  /**
>   * d_instantiate_no_diralias - instantiate a non-aliased dentry
>   * @entry: dentry to complete
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index 847904aa63a9..7bba8f2693b2 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -283,8 +283,7 @@ ecryptfs_create(struct inode *directory_inode, struct dentry *ecryptfs_dentry,
>                 iget_failed(ecryptfs_inode);
>                 goto out;
>         }
> -       unlock_new_inode(ecryptfs_inode);
> -       d_instantiate(ecryptfs_dentry, ecryptfs_inode);
> +       d_instantiate_new(ecryptfs_dentry, ecryptfs_inode);
>  out:
>         return rc;
>  }
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 55f7caadb093..152453a91877 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -41,8 +41,7 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
>  {
>         int err = ext2_add_link(dentry, inode);
>         if (!err) {
> -               unlock_new_inode(inode);
> -               d_instantiate(dentry, inode);
> +               d_instantiate_new(dentry, inode);
>                 return 0;
>         }
>         inode_dec_link_count(inode);
> @@ -255,8 +254,7 @@ static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
>         if (err)
>                 goto out_fail;
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>  out:
>         return err;
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index b1f21e3a0763..4a09063ce1d2 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2411,8 +2411,7 @@ static int ext4_add_nondir(handle_t *handle,
>         int err = ext4_add_entry(handle, dentry, inode);
>         if (!err) {
>                 ext4_mark_inode_dirty(handle, inode);
> -               unlock_new_inode(inode);
> -               d_instantiate(dentry, inode);
> +               d_instantiate_new(dentry, inode);
>                 return 0;
>         }
>         drop_nlink(inode);
> @@ -2651,8 +2650,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>         err = ext4_mark_inode_dirty(handle, dir);
>         if (err)
>                 goto out_clear_inode;
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         if (IS_DIRSYNC(dir))
>                 ext4_handle_sync(handle);
>
> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
> index d5098efe577c..75e37fd720b2 100644
> --- a/fs/f2fs/namei.c
> +++ b/fs/f2fs/namei.c
> @@ -294,8 +294,7 @@ static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
>
>         alloc_nid_done(sbi, ino);
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>
>         if (IS_DIRSYNC(dir))
>                 f2fs_sync_fs(sbi->sb, 1);
> @@ -597,8 +596,7 @@ static int f2fs_symlink(struct inode *dir, struct dentry *dentry,
>         err = page_symlink(inode, disk_link.name, disk_link.len);
>
>  err_out:
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>
>         /*
>          * Let's flush symlink data in order to avoid broken symlink as much as
> @@ -661,8 +659,7 @@ static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>
>         alloc_nid_done(sbi, inode->i_ino);
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>
>         if (IS_DIRSYNC(dir))
>                 f2fs_sync_fs(sbi->sb, 1);
> @@ -713,8 +710,7 @@ static int f2fs_mknod(struct inode *dir, struct dentry *dentry,
>
>         alloc_nid_done(sbi, inode->i_ino);
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>
>         if (IS_DIRSYNC(dir))
>                 f2fs_sync_fs(sbi->sb, 1);
> diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
> index 0a754f38462e..e5a6deb38e1e 100644
> --- a/fs/jffs2/dir.c
> +++ b/fs/jffs2/dir.c
> @@ -209,8 +209,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry *dentry,
>                   __func__, inode->i_ino, inode->i_mode, inode->i_nlink,
>                   f->inocache->pino_nlink, inode->i_mapping->nrpages);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         return 0;
>
>   fail:
> @@ -430,8 +429,7 @@ static int jffs2_symlink (struct inode *dir_i, struct dentry *dentry, const char
>         mutex_unlock(&dir_f->sem);
>         jffs2_complete_reservation(c);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         return 0;
>
>   fail:
> @@ -575,8 +573,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode
>         mutex_unlock(&dir_f->sem);
>         jffs2_complete_reservation(c);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         return 0;
>
>   fail:
> @@ -747,8 +744,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
>         mutex_unlock(&dir_f->sem);
>         jffs2_complete_reservation(c);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         return 0;
>
>   fail:
> diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
> index b41596d71858..56c3fcbfe80e 100644
> --- a/fs/jfs/namei.c
> +++ b/fs/jfs/namei.c
> @@ -178,8 +178,7 @@ static int jfs_create(struct inode *dip, struct dentry *dentry, umode_t mode,
>                 unlock_new_inode(ip);
>                 iput(ip);
>         } else {
> -               unlock_new_inode(ip);
> -               d_instantiate(dentry, ip);
> +               d_instantiate_new(dentry, ip);
>         }
>
>        out2:
> @@ -313,8 +312,7 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode)
>                 unlock_new_inode(ip);
>                 iput(ip);
>         } else {
> -               unlock_new_inode(ip);
> -               d_instantiate(dentry, ip);
> +               d_instantiate_new(dentry, ip);
>         }
>
>        out2:
> @@ -1059,8 +1057,7 @@ static int jfs_symlink(struct inode *dip, struct dentry *dentry,
>                 unlock_new_inode(ip);
>                 iput(ip);
>         } else {
> -               unlock_new_inode(ip);
> -               d_instantiate(dentry, ip);
> +               d_instantiate_new(dentry, ip);
>         }
>
>        out2:
> @@ -1447,8 +1444,7 @@ static int jfs_mknod(struct inode *dir, struct dentry *dentry,
>                 unlock_new_inode(ip);
>                 iput(ip);
>         } else {
> -               unlock_new_inode(ip);
> -               d_instantiate(dentry, ip);
> +               d_instantiate_new(dentry, ip);
>         }
>
>        out1:
> diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
> index 1a2894aa0194..dd52d3f82e8d 100644
> --- a/fs/nilfs2/namei.c
> +++ b/fs/nilfs2/namei.c
> @@ -46,8 +46,7 @@ static inline int nilfs_add_nondir(struct dentry *dentry, struct inode *inode)
>         int err = nilfs_add_link(dentry, inode);
>
>         if (!err) {
> -               d_instantiate(dentry, inode);
> -               unlock_new_inode(inode);
> +               d_instantiate_new(dentry, inode);
>                 return 0;
>         }
>         inode_dec_link_count(inode);
> @@ -243,8 +242,7 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>                 goto out_fail;
>
>         nilfs_mark_inode_dirty(inode);
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>  out:
>         if (!err)
>                 err = nilfs_transaction_commit(dir->i_sb);
> diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
> index 6e3134e6d98a..1b5707c44c3f 100644
> --- a/fs/orangefs/namei.c
> +++ b/fs/orangefs/namei.c
> @@ -75,8 +75,7 @@ static int orangefs_create(struct inode *dir,
>                      get_khandle_from_ino(inode),
>                      dentry);
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>         orangefs_set_timeout(dentry);
>         ORANGEFS_I(inode)->getattr_time = jiffies - 1;
>         ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> @@ -332,8 +331,7 @@ static int orangefs_symlink(struct inode *dir,
>                      "Assigned symlink inode new number of %pU\n",
>                      get_khandle_from_ino(inode));
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>         orangefs_set_timeout(dentry);
>         ORANGEFS_I(inode)->getattr_time = jiffies - 1;
>         ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> @@ -402,8 +400,7 @@ static int orangefs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
>                      "Assigned dir inode new number of %pU\n",
>                      get_khandle_from_ino(inode));
>
> -       d_instantiate(dentry, inode);
> -       unlock_new_inode(inode);
> +       d_instantiate_new(dentry, inode);
>         orangefs_set_timeout(dentry);
>         ORANGEFS_I(inode)->getattr_time = jiffies - 1;
>         ORANGEFS_I(inode)->getattr_mask = STATX_BASIC_STATS;
> diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
> index bd39a998843d..5089dac02660 100644
> --- a/fs/reiserfs/namei.c
> +++ b/fs/reiserfs/namei.c
> @@ -687,8 +687,7 @@ static int reiserfs_create(struct inode *dir, struct dentry *dentry, umode_t mod
>         reiserfs_update_inode_transaction(inode);
>         reiserfs_update_inode_transaction(dir);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         retval = journal_end(&th);
>
>  out_failed:
> @@ -771,8 +770,7 @@ static int reiserfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode
>                 goto out_failed;
>         }
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         retval = journal_end(&th);
>
>  out_failed:
> @@ -871,8 +869,7 @@ static int reiserfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
>         /* the above add_entry did not update dir's stat data */
>         reiserfs_update_sd(&th, dir);
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         retval = journal_end(&th);
>  out_failed:
>         reiserfs_write_unlock(dir->i_sb);
> @@ -1187,8 +1184,7 @@ static int reiserfs_symlink(struct inode *parent_dir,
>                 goto out_failed;
>         }
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         retval = journal_end(&th);
>  out_failed:
>         reiserfs_write_unlock(parent_dir->i_sb);
> diff --git a/fs/udf/namei.c b/fs/udf/namei.c
> index 0458dd47e105..c586026508db 100644
> --- a/fs/udf/namei.c
> +++ b/fs/udf/namei.c
> @@ -622,8 +622,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
>         if (fibh.sbh != fibh.ebh)
>                 brelse(fibh.ebh);
>         brelse(fibh.sbh);
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>
>         return 0;
>  }
> @@ -733,8 +732,7 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>         inc_nlink(dir);
>         dir->i_ctime = dir->i_mtime = current_time(dir);
>         mark_inode_dirty(dir);
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         if (fibh.sbh != fibh.ebh)
>                 brelse(fibh.ebh);
>         brelse(fibh.sbh);
> diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
> index 32545cd00ceb..d5f43ba76c59 100644
> --- a/fs/ufs/namei.c
> +++ b/fs/ufs/namei.c
> @@ -39,8 +39,7 @@ static inline int ufs_add_nondir(struct dentry *dentry, struct inode *inode)
>  {
>         int err = ufs_add_link(dentry, inode);
>         if (!err) {
> -               unlock_new_inode(inode);
> -               d_instantiate(dentry, inode);
> +               d_instantiate_new(dentry, inode);
>                 return 0;
>         }
>         inode_dec_link_count(inode);
> @@ -193,8 +192,7 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
>         if (err)
>                 goto out_fail;
>
> -       unlock_new_inode(inode);
> -       d_instantiate(dentry, inode);
> +       d_instantiate_new(dentry, inode);
>         return 0;
>
>  out_fail:
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 94acbde17bb1..66c6e17e61e5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -224,6 +224,7 @@ extern seqlock_t rename_lock;
>   * These are the low-level FS interfaces to the dcache..
>   */
>  extern void d_instantiate(struct dentry *, struct inode *);
> +extern void d_instantiate_new(struct dentry *, struct inode *);
>  extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
>  extern struct dentry * d_instantiate_anon(struct dentry *, struct inode *);
>  extern int d_instantiate_no_diralias(struct dentry *, struct inode *);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-10 18:20 [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely Al Viro
  2018-05-10 19:11 ` Andreas Dilger
  2018-05-10 20:44 ` Mike Marshall
@ 2018-05-10 22:56 ` Dave Chinner
  2018-05-11  0:39   ` Al Viro
  2 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-05-10 22:56 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Thu, May 10, 2018 at 07:20:58PM +0100, Al Viro wrote:
> [in the spirit of "don't put 'em in without posting for review; the
> this is present in vfs.git#for-linus, if you prefer to look in git.
> 
> Background: a bunch of nfsd races fixes from back in 2008 had
> problems with lockdep enabled; in 2012 that got "fixed", unfortunately
> reopening a narrow race window.  The patch below does *NOT* fix
> all filesystems, but it does fix most of the exported local ones
> and it is easy to backport, so it makes for a sane starting point.

Do you have a pointer to the commits/test case for this? XFS has a
fairly significant separation between unlock_new_inode() and dentry
cache instantiation for some paths....

> If anyone has objections, this is your chance to yell.
> ]
> 
> For anything NFS-exported we do _not_ want to unlock new inode
> before it has grown an alias; original set of fixes got the
> ordering right, but missed the nasty complication in case of
> lockdep being enabled - unlock_new_inode() does
>     lockdep_annotate_inode_mutex_key(inode)
> which can only be done before anyone gets a chance to touch
> ->i_mutex.  Unfortunately, flipping the order and doing
> unlock_new_inode() before d_instantiate() opens a window when
> mkdir can race with open-by-fhandle on a guessed fhandle, leading
> to multiple aliases for a directory inode and all the breakage
> that follows from that.
> 
>     Correct solution: a new primitive (d_instantiate_new())
> combining these two in the right order - lockdep annotate, then
> d_instantiate(), then the rest of unlock_new_inode().  All
> combinations of d_instantiate() with unlock_new_inode() should
> be converted to that.

Ok, so this seems to touch only the paths that create new inodes
(mkdir, mknod, etc). Is the lookup path that does:


	unlock_new_inode()
	.....
	d_splice_alias(inode, dentry);

OK?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-10 22:56 ` Dave Chinner
@ 2018-05-11  0:39   ` Al Viro
  2018-05-11  1:32     ` Dave Chinner
  2018-05-11  6:15     ` Ritesh Harjani
  0 siblings, 2 replies; 11+ messages in thread
From: Al Viro @ 2018-05-11  0:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Linus Torvalds

On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:

> > For anything NFS-exported we do _not_ want to unlock new inode
> > before it has grown an alias; original set of fixes got the
> > ordering right, but missed the nasty complication in case of
> > lockdep being enabled - unlock_new_inode() does
> >     lockdep_annotate_inode_mutex_key(inode)
> > which can only be done before anyone gets a chance to touch
> > ->i_mutex.  Unfortunately, flipping the order and doing
> > unlock_new_inode() before d_instantiate() opens a window when
> > mkdir can race with open-by-fhandle on a guessed fhandle, leading
> > to multiple aliases for a directory inode and all the breakage
> > that follows from that.
> > 
> >     Correct solution: a new primitive (d_instantiate_new())
> > combining these two in the right order - lockdep annotate, then
> > d_instantiate(), then the rest of unlock_new_inode().  All
> > combinations of d_instantiate() with unlock_new_inode() should
> > be converted to that.
> 
> Ok, so this seems to touch only the paths that create new inodes
> (mkdir, mknod, etc). Is the lookup path that does:
> 
> 
> 	unlock_new_inode()
> 	.....
> 	d_splice_alias(inode, dentry);
> 
> OK?

Yes.  d_splice_alias()
	* will do the right thing when it runs into directory inode
that already has an alias
	* is called from ->d_lookup(), which has calling conventions
allowing to return a preexisting alias

The race in question is between mkdir() and open-by-fhandle that manages
to guess an fhandle for directory about to be created.  mkdir() side
creates a new inode, inserts it into icache (locked) and proceeds towards
unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
unlock_new_inode() and open-by-fhandle picks the inode from icache
(either having just gotten there, or finally gets woken up after having
waited for the sucker to get unlocked).  inode is valid, everything's
set up properply, so we pass it to d_obtain_alias(), which sees that
there's no exiting dentries, allocates one, rechecks, finds that there's
still nothing and proceeds to attach its new anon dentry to that inode.
Now mkdir regains CPU and does d_instantiate().  And we are fucked -
there are *two* dentries for given directory inode.

The window is narrow - to have a chance to hit it you need either
to run it in a VM or have security_d_instantiate() (from d_instantiate())
to do something slow (ideally - blocking).  It's non-empty, though.

Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
window - open-by-fhandle won't get to the inode until after mkdir has
attached a dentry to it.  Then d_obtain_alias() will simply return that
dentry and we are done.  It's only d_instantiate() (or d_add()) that is
a problem - d_splice_alias() is fine, so on the lookup path we don't
need anything like that.  d_add_ci() is like d_splice_alias() in that
respect, so the lookup is OK in case-insensitive variant as well.

So it would appear that XFS doesn't need to be touched.  HOWEVER,
lockdep shite *can't* be done after something has had a chance to grab
the damn rwsem.  I really wonder if
		d_instantiate(dentry, inode);
        xfs_finish_inode_setup(cip);
doesn't lead to unpleasantness with lockdep enabled:
	xfs_finish_inode_setup() -> unlock_new_inode() ->
lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
which does wonders if something has already gotten to the inode
via that dentry and tried e.g. lock_inode() on it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-11  0:39   ` Al Viro
@ 2018-05-11  1:32     ` Dave Chinner
  2018-05-11  2:18       ` Al Viro
  2018-05-11  6:15     ` Ritesh Harjani
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-05-11  1:32 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Fri, May 11, 2018 at 01:39:01AM +0100, Al Viro wrote:
> On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:
> 
> > > For anything NFS-exported we do _not_ want to unlock new inode
> > > before it has grown an alias; original set of fixes got the
> > > ordering right, but missed the nasty complication in case of
> > > lockdep being enabled - unlock_new_inode() does
> > >     lockdep_annotate_inode_mutex_key(inode)
> > > which can only be done before anyone gets a chance to touch
> > > ->i_mutex.  Unfortunately, flipping the order and doing
> > > unlock_new_inode() before d_instantiate() opens a window when
> > > mkdir can race with open-by-fhandle on a guessed fhandle, leading
> > > to multiple aliases for a directory inode and all the breakage
> > > that follows from that.
> > > 
> > >     Correct solution: a new primitive (d_instantiate_new())
> > > combining these two in the right order - lockdep annotate, then
> > > d_instantiate(), then the rest of unlock_new_inode().  All
> > > combinations of d_instantiate() with unlock_new_inode() should
> > > be converted to that.
> > 
> > Ok, so this seems to touch only the paths that create new inodes
> > (mkdir, mknod, etc). Is the lookup path that does:
> > 
> > 
> > 	unlock_new_inode()
> > 	.....
> > 	d_splice_alias(inode, dentry);
> > 
> > OK?
> 
> Yes.  d_splice_alias()
> 	* will do the right thing when it runs into directory inode
> that already has an alias
> 	* is called from ->d_lookup(), which has calling conventions
> allowing to return a preexisting alias
> 
> The race in question is between mkdir() and open-by-fhandle that manages
> to guess an fhandle for directory about to be created.  mkdir() side
> creates a new inode, inserts it into icache (locked) and proceeds towards
> unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
> unlock_new_inode() and open-by-fhandle picks the inode from icache
> (either having just gotten there, or finally gets woken up after having
> waited for the sucker to get unlocked).  inode is valid, everything's
> set up properply, so we pass it to d_obtain_alias(), which sees that
> there's no exiting dentries, allocates one, rechecks, finds that there's
> still nothing and proceeds to attach its new anon dentry to that inode.
> Now mkdir regains CPU and does d_instantiate().  And we are fucked -
> there are *two* dentries for given directory inode.

Ok, thanks for the description of the race, Al. I understand it now.
:)

> 
> The window is narrow - to have a chance to hit it you need either
> to run it in a VM or have security_d_instantiate() (from d_instantiate())
> to do something slow (ideally - blocking).  It's non-empty, though.
> 
> Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
> window - open-by-fhandle won't get to the inode until after mkdir has
> attached a dentry to it.  Then d_obtain_alias() will simply return that
> dentry and we are done.  It's only d_instantiate() (or d_add()) that is
> a problem - d_splice_alias() is fine, so on the lookup path we don't
> need anything like that.  d_add_ci() is like d_splice_alias() in that
> respect, so the lookup is OK in case-insensitive variant as well.
> 
> So it would appear that XFS doesn't need to be touched.  HOWEVER,
> lockdep shite *can't* be done after something has had a chance to grab
> the damn rwsem.  I really wonder if
> 		d_instantiate(dentry, inode);
>         xfs_finish_inode_setup(cip);
> doesn't lead to unpleasantness with lockdep enabled:
> 	xfs_finish_inode_setup() -> unlock_new_inode() ->
> lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
> which does wonders if something has already gotten to the inode
> via that dentry and tried e.g. lock_inode() on it.

Could well do. Though it seems fixable.

i.e. we already have code in xfs_setup_inode() that sets the xfs
inode ILOCK rwsem dir/non-dir lockdep class before the new inode is
unlocked - we could just do the i_rwsem lockdep setup there, too.
Then, if we were to factor unlock_new_inode() as Andreas suggested,
we could call __unlock_new_inode() from xfs_finish_inode_setup().

I might be missing something subtle, but that looks to me like it
would work.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-11  1:32     ` Dave Chinner
@ 2018-05-11  2:18       ` Al Viro
  2018-05-11  3:00         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2018-05-11  2:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Linus Torvalds

On Fri, May 11, 2018 at 11:32:08AM +1000, Dave Chinner wrote:

> i.e. we already have code in xfs_setup_inode() that sets the xfs
> inode ILOCK rwsem dir/non-dir lockdep class before the new inode is
> unlocked - we could just do the i_rwsem lockdep setup there, too.

... which would suffice -

        if (S_ISDIR(inode->i_mode)) {
                struct file_system_type *type = inode->i_sb->s_type;

                /* Set new key only if filesystem hasn't already changed it */
                if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {

in lockdep_annotate_inode_mutex_key() would make sure that ->i_rwsem will be
left alone by unlock_new_inode().

> Then, if we were to factor unlock_new_inode() as Andreas suggested,
> we could call __unlock_new_inode() from xfs_finish_inode_setup().

No need - if you set the class in xfs_setup_inode(), you are fine.

Said that, hash insertion is also potentially delicate - another ext2/nfsd
race from the same pile back in 2008 had been
	* ext2_new_inode() chooses inumber
	* open-by-fhandle guesses the inumber and hits ext2_iget(), which
inserts a locked in-core inode into icache and proceeds to block reading
it from disk.
	* ext2_new_inode() inserts *its* in-core inode into icache (with
the same inumber) and sets the things up, both in-core and on disk
	* open-by-fhandle is back and sees a good live on-disk inode.
It finishes setting the in-core one up and we'd got *TWO* in-core inodes
with the same inumber, both hashed, both with dentries, both used by
syscalls to do IO.  Good times all around - fs corruption is fun.

That was fixed by using insert_inode_locked() in ext2_new_inode(), and doing
that before the on-disk inode would start looking good.  If it came during
ext2_iget(), it would've found an in-core inode with that inumber (locked,
doomed to be rejected), waited for it to come unlocked, see it unhashed
(since ext2_iget() said it was no good) and inserted its in-core inode
into hash (after having rechecked that nobody had an in-core inode with
the same inumber in there, that is).

I'm not familiar enough with XFS icache replacment to tell if anything
of that sort is a problem there; might be a non-issue for any number
of reasons.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-11  2:18       ` Al Viro
@ 2018-05-11  3:00         ` Dave Chinner
  2018-05-11 19:56           ` Al Viro
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-05-11  3:00 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

On Fri, May 11, 2018 at 03:18:43AM +0100, Al Viro wrote:
> On Fri, May 11, 2018 at 11:32:08AM +1000, Dave Chinner wrote:
> 
> > i.e. we already have code in xfs_setup_inode() that sets the xfs
> > inode ILOCK rwsem dir/non-dir lockdep class before the new inode is
> > unlocked - we could just do the i_rwsem lockdep setup there, too.
> 
> ... which would suffice -
> 
>         if (S_ISDIR(inode->i_mode)) {
>                 struct file_system_type *type = inode->i_sb->s_type;
> 
>                 /* Set new key only if filesystem hasn't already changed it */
>                 if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {
> 
> in lockdep_annotate_inode_mutex_key() would make sure that ->i_rwsem will be
> left alone by unlock_new_inode().

Ok, If you are happy with XFs doing that, I'll put together a patch
and send it out.

> > Then, if we were to factor unlock_new_inode() as Andreas suggested,
> > we could call __unlock_new_inode() from xfs_finish_inode_setup().
> 
> No need - if you set the class in xfs_setup_inode(), you are fine.
> 
> Said that, hash insertion is also potentially delicate - another ext2/nfsd
> race from the same pile back in 2008 had been
> 	* ext2_new_inode() chooses inumber
> 	* open-by-fhandle guesses the inumber and hits ext2_iget(), which
> inserts a locked in-core inode into icache and proceeds to block reading
> it from disk.
> 	* ext2_new_inode() inserts *its* in-core inode into icache (with
> the same inumber) and sets the things up, both in-core and on disk
> 	* open-by-fhandle is back and sees a good live on-disk inode.
> It finishes setting the in-core one up and we'd got *TWO* in-core inodes
> with the same inumber, both hashed, both with dentries, both used by
> syscalls to do IO.  Good times all around - fs corruption is fun.
> 
> That was fixed by using insert_inode_locked() in ext2_new_inode(), and doing
> that before the on-disk inode would start looking good.  If it came during
> ext2_iget(), it would've found an in-core inode with that inumber (locked,
> doomed to be rejected), waited for it to come unlocked, see it unhashed
> (since ext2_iget() said it was no good) and inserted its in-core inode
> into hash (after having rechecked that nobody had an in-core inode with
> the same inumber in there, that is).
> 
> I'm not familiar enough with XFS icache replacment to tell if anything
> of that sort is a problem there; might be a non-issue for any number
> of reasons.

I'm pretty sure we handle those cases - amongst other things we
don't trust inode numbers in filehandles and so validation of inode
numbers in incoming filehandles is serialised against
allocating/freeing of inodes before it even gets to inode cache
lookups...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-11  0:39   ` Al Viro
  2018-05-11  1:32     ` Dave Chinner
@ 2018-05-11  6:15     ` Ritesh Harjani
  1 sibling, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2018-05-11  6:15 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, linux-fsdevel, Linus Torvalds



On 5/11/2018 6:09 AM, Al Viro wrote:
> On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:
> 
>>> For anything NFS-exported we do _not_ want to unlock new inode
>>> before it has grown an alias; original set of fixes got the
>>> ordering right, but missed the nasty complication in case of
>>> lockdep being enabled - unlock_new_inode() does
>>>      lockdep_annotate_inode_mutex_key(inode)
>>> which can only be done before anyone gets a chance to touch
>>> ->i_mutex.  Unfortunately, flipping the order and doing
>>> unlock_new_inode() before d_instantiate() opens a window when
>>> mkdir can race with open-by-fhandle on a guessed fhandle, leading
>>> to multiple aliases for a directory inode and all the breakage
>>> that follows from that.
>>>
>>>      Correct solution: a new primitive (d_instantiate_new())
>>> combining these two in the right order - lockdep annotate, then
>>> d_instantiate(), then the rest of unlock_new_inode().  All
>>> combinations of d_instantiate() with unlock_new_inode() should
>>> be converted to that.
>>
>> Ok, so this seems to touch only the paths that create new inodes
>> (mkdir, mknod, etc). Is the lookup path that does:
>>
>>
>> 	unlock_new_inode()
>> 	.....
>> 	d_splice_alias(inode, dentry);
>>
>> OK?
> 
> Yes.  d_splice_alias()
> 	* will do the right thing when it runs into directory inode
> that already has an alias
> 	* is called from ->d_lookup(), which has calling conventions
> allowing to return a preexisting alias
> 
> The race in question is between mkdir() and open-by-fhandle that manages
> to guess an fhandle for directory about to be created.  mkdir() side
> creates a new inode, inserts it into icache (locked) and proceeds towards
> unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
> unlock_new_inode() and open-by-fhandle picks the inode from icache
> (either having just gotten there, or finally gets woken up after having
> waited for the sucker to get unlocked).  inode is valid, everything's
> set up properply, so we pass it to d_obtain_alias(), which sees that
> there's no exiting dentries, allocates one, rechecks, finds that there's
> still nothing and proceeds to attach its new anon dentry to that inode.
> Now mkdir regains CPU and does d_instantiate().  And we are fucked -
> there are *two* dentries for given directory inode.
> 
> The window is narrow - to have a chance to hit it you need either
> to run it in a VM or have security_d_instantiate() (from d_instantiate())
> to do something slow (ideally - blocking).  It's non-empty, though.
> 
> Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
> window - open-by-fhandle won't get to the inode until after mkdir has
> attached a dentry to it.  Then d_obtain_alias() will simply return that
> dentry and we are done.  It's only d_instantiate() (or d_add()) that is
> a problem - d_splice_alias() is fine, so on the lookup path we don't
> need anything like that.  d_add_ci() is like d_splice_alias() in that
> respect, so the lookup is OK in case-insensitive variant as well.
> 

Nice explanation -Could we add this description to the commit msg and 
also document about this API in -
"Documentation/filesystems/vfs.txt" under heading "Directory Entry Cache 
API".
That would be helpful for others later as well.

Thanks
Ritesh

> So it would appear that XFS doesn't need to be touched.  HOWEVER,
> lockdep shite *can't* be done after something has had a chance to grab
> the damn rwsem.  I really wonder if
> 		d_instantiate(dentry, inode);
>          xfs_finish_inode_setup(cip);
> doesn't lead to unpleasantness with lockdep enabled:
> 	xfs_finish_inode_setup() -> unlock_new_inode() ->
> lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
> which does wonders if something has already gotten to the inode
> via that dentry and tried e.g. lock_inode() on it.
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely
  2018-05-11  3:00         ` Dave Chinner
@ 2018-05-11 19:56           ` Al Viro
  0 siblings, 0 replies; 11+ messages in thread
From: Al Viro @ 2018-05-11 19:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Linus Torvalds

On Fri, May 11, 2018 at 01:00:29PM +1000, Dave Chinner wrote:
> On Fri, May 11, 2018 at 03:18:43AM +0100, Al Viro wrote:
> > On Fri, May 11, 2018 at 11:32:08AM +1000, Dave Chinner wrote:
> > 
> > > i.e. we already have code in xfs_setup_inode() that sets the xfs
> > > inode ILOCK rwsem dir/non-dir lockdep class before the new inode is
> > > unlocked - we could just do the i_rwsem lockdep setup there, too.
> > 
> > ... which would suffice -
> > 
> >         if (S_ISDIR(inode->i_mode)) {
> >                 struct file_system_type *type = inode->i_sb->s_type;
> > 
> >                 /* Set new key only if filesystem hasn't already changed it */
> >                 if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {
> > 
> > in lockdep_annotate_inode_mutex_key() would make sure that ->i_rwsem will be
> > left alone by unlock_new_inode().
> 
> Ok, If you are happy with XFs doing that, I'll put together a patch
> and send it out.

That should probably go through xfs tree - no impact outside of fs/xfs...

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-05-11 19:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-10 18:20 [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely Al Viro
2018-05-10 19:11 ` Andreas Dilger
2018-05-10 19:32   ` Al Viro
2018-05-10 20:44 ` Mike Marshall
2018-05-10 22:56 ` Dave Chinner
2018-05-11  0:39   ` Al Viro
2018-05-11  1:32     ` Dave Chinner
2018-05-11  2:18       ` Al Viro
2018-05-11  3:00         ` Dave Chinner
2018-05-11 19:56           ` Al Viro
2018-05-11  6:15     ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).