From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14B9BC43381 for ; Mon, 25 Mar 2019 19:43:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E942420879 for ; Mon, 25 Mar 2019 19:43:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729840AbfCYTni (ORCPT ); Mon, 25 Mar 2019 15:43:38 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:54194 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729473AbfCYTni (ORCPT ); Mon, 25 Mar 2019 15:43:38 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92 #3 (Red Hat Linux)) id 1h8VVM-000245-7z; Mon, 25 Mar 2019 19:43:32 +0000 Date: Mon, 25 Mar 2019 19:43:32 +0000 From: Al Viro To: Linus Torvalds Cc: syzbot , Alexei Starovoitov , Daniel Borkmann , linux-fsdevel , Linux List Kernel Mailing , syzkaller-bugs Subject: Re: KASAN: use-after-free Read in path_lookupat Message-ID: <20190325194332.GO2217@ZenIV.linux.org.uk> References: <0000000000006946d2057bbd0eef@google.com> <20190325045744.GK2217@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Mon, Mar 25, 2019 at 11:36:01AM -0700, Linus Torvalds wrote: > Right. Not just move the existing destroy_inode() - because as you > say, people may not be able to to do that in RCU contect, but split it > up, and add a "final_free_inode()" callback or something for the RCU > phase. > > In fact, I suspect that *every* user of destroy_inode() already has to > have its own RCU callback to another final freeing function anyway. Nope - pipes do not, and for a good reason. > Because they really shouldn't free the inode itself early. Maybe we > can just make that be a generic thing? Maybe... OTOH, we already have more methods on the destruction end than I'm comfortable trying to document. Because we clearly *do* need a clear set of rules along the lines of "this kind of thing belongs in this method, this - in that one". As it is, on the way to inode destruction we have 1) [kinda] ->drop_inode() - decide whether this inode is not worth keeping around in icache once the refcount reaches zero. Predicate, shouldn't change inode state at all. Common instances: * default (encoded as NULL, available as generic_drop_inode()) - "keep it around if it's still hashed and link count is non-zero". * generic_delete_inode(): "don't retain that sucker" However, 3 instances are more or less weird - f2fs, gfs and ocfs2. gfs2 one is the least odd of those, but the other two... What the hell is forced writeback doing in ocfs2_drop_inode()? If they don't want to retain anything, fine, but then why do the inode->i_state |= I_WILL_FREE; spin_unlock(&inode->i_lock); write_inode_now(inode, 1); spin_lock(&inode->i_lock); WARN_ON(inode->i_state & I_NEW); inode->i_state &= ~I_WILL_FREE; dance in ->drop_inode()? It will be immediately followed by ->evict_inode(), and it feels like the damn thing would be a lot more natural there... And what, for pity sake, f2fs is doing with truncation in that predicate, of all places? The comment in there is /* * This is to avoid a deadlock condition like below. * writeback_single_inode(inode) * - f2fs_write_data_page * - f2fs_gc -> iput -> evict * - inode_wait_for_writeback(inode) */ which looks... uninspiring, to put it mildly. 2) ->evict_inode() - called when we kick the inode out. Freeing on-disk inodes, etc. belongs there. inode is still in icache hash chain at that point, so any icache lookups for it will block until that thing is done. If we have something non-trivial done by iget... test() callbacks, we must keep the data structures needed by those. 3) ->destroy_inode() - destructor. By that point all remaining references to inode are (stale) RCU ones. The stuff that might be reached via those has to have an RCU delay between the call of ->destroy_inode() and freeing. Very commonly that's done by call_rcu(), and more often than not it's the only thing in ->destroy_inode(). However, if we know that there'll be no RCU accessors, we can do freeing immediately - pipes do just that. And the above is piss-poor as documentation goes - it doesn't answer the "where should this go?" any better than "try to see what similar filesystems are doing", which is asking for cargo-culting ;-/