From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EABEACA9ECF for ; Fri, 1 Nov 2019 23:46:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CAE6821734 for ; Fri, 1 Nov 2019 23:46:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726532AbfKAXq0 (ORCPT ); Fri, 1 Nov 2019 19:46:26 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:40806 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725989AbfKAXqZ (ORCPT ); Fri, 1 Nov 2019 19:46:25 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1iQgcY-0003GU-3x; Fri, 01 Nov 2019 23:46:22 +0000 Date: Fri, 1 Nov 2019 23:46:22 +0000 From: Al Viro To: Ritesh Harjani Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com, Jan Kara , Linus Torvalds , "Paul E. McKenney" Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Message-ID: <20191101234622.GM26530@ZenIV.linux.org.uk> References: <20190927044243.18856-1-riteshh@linux.ibm.com> <20191015040730.6A84742047@d06av24.portsmouth.uk.ibm.com> <20191022133855.B1B4752050@d06av21.portsmouth.uk.ibm.com> <20191022143736.GX26530@ZenIV.linux.org.uk> <20191022201131.GZ26530@ZenIV.linux.org.uk> <20191023110551.D04AE4C044@d06av22.portsmouth.uk.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191023110551.D04AE4C044@d06av22.portsmouth.uk.ibm.com> User-Agent: Mutt/1.12.1 (2019-06-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Oct 23, 2019 at 04:35:50PM +0530, Ritesh Harjani wrote: > > > What we have guaranteed is > > > * ->d_lock serializes ->d_flags/->d_inode changes > > > * ->d_seq is bumped before/after such changes > > > * positive dentry never changes ->d_inode as long as you hold > > > a reference (negative dentry *can* become positive right under you) > > > > > > So there are 3 classes of valid users: those holding ->d_lock, those > > > sampling and rechecking ->d_seq and those relying upon having observed > > > the sucker they've pinned to be positive. > > :) Thanks for simplifying like this. Agreed. FWIW, after fixing several ceph bugs, add to that the following: * all places that turn a negative dentry into positive one are holding its parent exclusive or dentry has not been observable for anybody else. It had been present in the parent's list of children (negative and unhashed) and it might have been present in in-lookup hashtable. However, nobody is going to grab a reference to it from there without having grabbed ->d_lock on it and observed the state after it became positive. Which means that holding a reference to dentry *and* holding its parent at least shared stabilizes both ->d_inode and type bits in ->d_flags. The situation with barriers is more subtle - *IF* we had sufficient barriers to have ->d_inode/type bits seen right after having gotten the reference, we are fine. The only change possible after that point is negative->positive transition and that gets taken care of by barriers provided by ->i_rwsem. If we'd obtained that reference by d_lookup() or __d_lookup(), we are fine - ->d_lock gives a barrier. The same goes for places that grab references during a tree traversal, provided that they hold ->d_lock around that (fs/autofs/expire.c stuff). The same goes for having it found in inode's aliases list (->i_lock). I really hope that the same applies to accesses to file_dentry(file); on anything except alpha that would be pretty much automatic and on alpha we get the things along the lines of f = fdt[n] mb d = f->f_path.dentry i = d->d_inode assert(i != NULL) vs. see that d->d_inode is non-NULL f->f_path.dentry = d mb fdt[n] = f IOW, the barriers that make it safe to fetch the fields of struct file (rcu_dereference_raw() in __fcheck_files() vs. smp_store_release() in __fd_install() in the above) should *hopefully* take care of all stores visible by the time of do_dentry_open(). Sure, alpha cache coherency is insane, but AFAICS it's not _that_ insane. Question to folks familiar with alpha memory model: A = 0, B = NULL, C = NULL CPU1: A = 1 CPU2: r1 = A if (r1) { B = &A mb C = &B } CPU3: r2 = C; mb if (r2) { // &B r3 = *r2 // &A r4 = *r3 // 1 assert(r4 == 1) } is the above safe on alpha? [snip] > We may also need similar guarantees with __d_clear_type_and_inode(). Not really - pinned dentry can't go negative. In any case, with the audit I've done so far, I don't believe that blanket solutions like that are good idea - most of the places doing checks are safe as it is. The surface that needs to be taken care of is fairly small, actually; most of that is in fs/namei.c and fs/dcache.c.