From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751609AbaDRAhh (ORCPT ); Thu, 17 Apr 2014 20:37:37 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:34270 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750893AbaDRAhc (ORCPT ); Thu, 17 Apr 2014 20:37:32 -0400 Date: Fri, 18 Apr 2014 01:37:26 +0100 From: Al Viro To: "Eric W. Biederman" Cc: Linus Torvalds , "Serge E. Hallyn" , Linux-Fsdevel , Kernel Mailing List , Andy Lutomirski , Rob Landley , Miklos Szeredi , Christoph Hellwig , Karel Zak , "J. Bruce Fields" , Fengguang Wu , tytso@mit.edu Subject: Re: [GIT PULL] Detaching mounts on unlink for 3.15 Message-ID: <20140418003725.GE18016@ZenIV.linux.org.uk> References: <20140413053956.GM18016@ZenIV.linux.org.uk> <87zjjp3e7w.fsf@x220.int.ebiederm.org> <87ppkl1xb7.fsf@x220.int.ebiederm.org> <20140413215242.GP18016@ZenIV.linux.org.uk> <87y4z8uzqw.fsf_-_@x220.int.ebiederm.org> <87ppkhc4pp.fsf@x220.int.ebiederm.org> <87ha5r3emw.fsf_-_@x220.int.ebiederm.org> <20140417202237.GA18016@ZenIV.linux.org.uk> <87tx9rwsz4.fsf@x220.int.ebiederm.org> <20140417221203.GC18016@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140417221203.GC18016@ZenIV.linux.org.uk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 17, 2014 at 11:12:03PM +0100, Al Viro wrote: > That's all. And yes, I believe that such series would make sense on its > own and once it survives beating (see above about docker - that bastard has > surprised me quite a bit re stressing namespace-related codepaths), I would > be quite willing to help with getting it merged. FWIW, the tricky part around auto-close of acct is that we really want to preserve the following property: In usual setups, umount(2) will not return until fs has been shut down. fput() being async is not a problem - it will be processed before we return to userland. I agree that we don't need the loop anymore (it's basically a stack depth reduction measure that was needed with sync fput() - without "add one more and deal with it when we return" we would be getting mntput_no_expire -> fput -> mntput -> fs shutdown back then). But offloading that fput() to workqueue makes it really possible to have actual fs shutdown happen after umount(2) returns, without any extra mounts of the same fs, etc. And since that shutdown *can* take a long time (lots of dirty pages around, slow device or slow network, etc.), we really might be talking about e.g. umount(8) being finished before fs shutdown finishes. It's an expected situation when we have the same thing still mounted elsewhere or lazy-umounted and busy, but this changes behaviour on setups where we had been guaranteed that umount -a *will* wait until all filesystems except root are shut down and root is remounted r/o. So this change really can cause data loss on reboot(8)/halt(8) on existing boxen... IOW, workqueue is not the right tool here. OTOH, it looks like we do have a problem with kernel/acct.c vs. umount; it just requires a race between auto-closing and acct_process_in_ns(). It's narrow, so it doesn't bite us all the time, but it's there... Damn, it had been a long time since I really looked at that code ;-/ Actually, there's another reason why workqueue is bogus - we call do_acct_process(), same as we do on acct(NULL) (which might or might not be a good idea), but at least with do that from the context of real process doing umount(2). Doing that from workqueue is going to produce a really bogus record...