From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.codeaurora.org ([198.145.29.96]:41226 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752590AbeEKGQG (ORCPT ); Fri, 11 May 2018 02:16:06 -0400 Subject: Re: [RFC][PATCH] do d_instantiate/unlock_new_inode combinations safely To: Al Viro Cc: Dave Chinner , linux-fsdevel@vger.kernel.org, Linus Torvalds References: <20180510182058.GP30522@ZenIV.linux.org.uk> <20180510225607.GU23861@dastard> <20180511003901.GW30522@ZenIV.linux.org.uk> From: Ritesh Harjani Message-ID: Date: Fri, 11 May 2018 11:45:59 +0530 MIME-Version: 1.0 In-Reply-To: <20180511003901.GW30522@ZenIV.linux.org.uk> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 5/11/2018 6:09 AM, Al Viro wrote: > On Fri, May 11, 2018 at 08:56:07AM +1000, Dave Chinner wrote: > >>> For anything NFS-exported we do _not_ want to unlock new inode >>> before it has grown an alias; original set of fixes got the >>> ordering right, but missed the nasty complication in case of >>> lockdep being enabled - unlock_new_inode() does >>> lockdep_annotate_inode_mutex_key(inode) >>> which can only be done before anyone gets a chance to touch >>> ->i_mutex. Unfortunately, flipping the order and doing >>> unlock_new_inode() before d_instantiate() opens a window when >>> mkdir can race with open-by-fhandle on a guessed fhandle, leading >>> to multiple aliases for a directory inode and all the breakage >>> that follows from that. >>> >>> Correct solution: a new primitive (d_instantiate_new()) >>> combining these two in the right order - lockdep annotate, then >>> d_instantiate(), then the rest of unlock_new_inode(). All >>> combinations of d_instantiate() with unlock_new_inode() should >>> be converted to that. >> >> Ok, so this seems to touch only the paths that create new inodes >> (mkdir, mknod, etc). Is the lookup path that does: >> >> >> unlock_new_inode() >> ..... >> d_splice_alias(inode, dentry); >> >> OK? > > Yes. d_splice_alias() > * will do the right thing when it runs into directory inode > that already has an alias > * is called from ->d_lookup(), which has calling conventions > allowing to return a preexisting alias > > The race in question is between mkdir() and open-by-fhandle that manages > to guess an fhandle for directory about to be created. mkdir() side > creates a new inode, inserts it into icache (locked) and proceeds towards > unlock_new_inode()/d_instantiate(). Suppose it loses CPU right after > unlock_new_inode() and open-by-fhandle picks the inode from icache > (either having just gotten there, or finally gets woken up after having > waited for the sucker to get unlocked). inode is valid, everything's > set up properply, so we pass it to d_obtain_alias(), which sees that > there's no exiting dentries, allocates one, rechecks, finds that there's > still nothing and proceeds to attach its new anon dentry to that inode. > Now mkdir regains CPU and does d_instantiate(). And we are fucked - > there are *two* dentries for given directory inode. > > The window is narrow - to have a chance to hit it you need either > to run it in a VM or have security_d_instantiate() (from d_instantiate()) > to do something slow (ideally - blocking). It's non-empty, though. > > Doing it in the opposite order (as XFS does on mkdir et.al.) plugs that > window - open-by-fhandle won't get to the inode until after mkdir has > attached a dentry to it. Then d_obtain_alias() will simply return that > dentry and we are done. It's only d_instantiate() (or d_add()) that is > a problem - d_splice_alias() is fine, so on the lookup path we don't > need anything like that. d_add_ci() is like d_splice_alias() in that > respect, so the lookup is OK in case-insensitive variant as well. > Nice explanation -Could we add this description to the commit msg and also document about this API in - "Documentation/filesystems/vfs.txt" under heading "Directory Entry Cache API". That would be helpful for others later as well. Thanks Ritesh > So it would appear that XFS doesn't need to be touched. HOWEVER, > lockdep shite *can't* be done after something has had a chance to grab > the damn rwsem. I really wonder if > d_instantiate(dentry, inode); > xfs_finish_inode_setup(cip); > doesn't lead to unpleasantness with lockdep enabled: > xfs_finish_inode_setup() -> unlock_new_inode() -> > lockdep_annotate_inode_mutex_key() -> init_rwsem(&inode->i_rwsem) > which does wonders if something has already gotten to the inode > via that dentry and tried e.g. lock_inode() on it. > -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.