From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from smtp.codeaurora.org ([198.145.29.96]:41226 "EHLO
        smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752590AbeEKGQG (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 11 May 2018 02:16:06 -0400
Subject: Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations
 safely
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>, linux-fsdevel@vger.kernel.org,
        Linus Torvalds <torvalds@linux-foundation.org>
References: <20180510182058.GP30522@ZenIV.linux.org.uk>
 <20180510225607.GU23861@dastard> <20180511003901.GW30522@ZenIV.linux.org.uk>
From: Ritesh Harjani <riteshh@codeaurora.org>
Message-ID: <a65939a8-58aa-68ce-0866-089c0ca5f31a@codeaurora.org>
Date: Fri, 11 May 2018 11:45:59 +0530
MIME-Version: 1.0
In-Reply-To: <20180511003901.GW30522@ZenIV.linux.org.uk>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>


On 5/11/2018 6:09 AM, Al Viro wrote:
> On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote:
> 
>>> For anything NFS-exported we do _not_ want to unlock new inode
>>> before it has grown an alias; original set of fixes got the
>>> ordering right, but missed the nasty complication in case of
>>> lockdep being enabled - unlock_new_inode() does
>>>      lockdep_annotate_inode_mutex_key(inode)
>>> which can only be done before anyone gets a chance to touch
>>> ->i_mutex.  Unfortunately, flipping the order and doing
>>> unlock_new_inode() before d_instantiate() opens a window when
>>> mkdir can race with open-by-fhandle on a guessed fhandle, leading
>>> to multiple aliases for a directory inode and all the breakage
>>> that follows from that.
>>>
>>>      Correct solution: a new primitive (d_instantiate_new())
>>> combining these two in the right order - lockdep annotate, then
>>> d_instantiate(), then the rest of unlock_new_inode().  All
>>> combinations of d_instantiate() with unlock_new_inode() should
>>> be converted to that.
>>
>> Ok, so this seems to touch only the paths that create new inodes
>> (mkdir, mknod, etc). Is the lookup path that does:
>>
>>
>> 	unlock_new_inode()
>> 	.....
>> 	d_splice_alias(inode, dentry);
>>
>> OK?
> 
> Yes.  d_splice_alias()
> 	* will do the right thing when it runs into directory inode
> that already has an alias
> 	* is called from ->d_lookup(), which has calling conventions
> allowing to return a preexisting alias
> 
> The race in question is between mkdir() and open-by-fhandle that manages
> to guess an fhandle for directory about to be created.  mkdir() side
> creates a new inode, inserts it into icache (locked) and proceeds towards
> unlock_new_inode()/d_instantiate().  Suppose it loses CPU right after
> unlock_new_inode() and open-by-fhandle picks the inode from icache
> (either having just gotten there, or finally gets woken up after having
> waited for the sucker to get unlocked).  inode is valid, everything's
> set up properply, so we pass it to d_obtain_alias(), which sees that
> there's no exiting dentries, allocates one, rechecks, finds that there's
> still nothing and proceeds to attach its new anon dentry to that inode.
> Now mkdir regains CPU and does d_instantiate().  And we are fucked -
> there are *two* dentries for given directory inode.
> 
> The window is narrow - to have a chance to hit it you need either
> to run it in a VM or have security_d_instantiate() (from d_instantiate())
> to do something slow (ideally - blocking).  It's non-empty, though.
> 
> Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that
> window - open-by-fhandle won't get to the inode until after mkdir has
> attached a dentry to it.  Then d_obtain_alias() will simply return that
> dentry and we are done.  It's only d_instantiate() (or d_add()) that is
> a problem - d_splice_alias() is fine, so on the lookup path we don't
> need anything like that.  d_add_ci() is like d_splice_alias() in that
> respect, so the lookup is OK in case-insensitive variant as well.
> 

Nice explanation -Could we add this description to the commit msg and 
also document about this API in -
"Documentation/filesystems/vfs.txt" under heading "Directory Entry Cache 
API".
That would be helpful for others later as well.

Thanks
Ritesh

> So it would appear that XFS doesn't need to be touched.  HOWEVER,
> lockdep shite *can't* be done after something has had a chance to grab
> the damn rwsem.  I really wonder if
> 		d_instantiate(dentry, inode);
>          xfs_finish_inode_setup(cip);
> doesn't lead to unpleasantness with lockdep enabled:
> 	xfs_finish_inode_setup() -> unlock_new_inode() ->
> lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem)
> which does wonders if something has already gotten to the inode
> via that dentry and tried e.g. lock_inode() on it.
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.