From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4945FC33C99 for ; Fri, 10 Jan 2020 06:21:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E3D6A2073A for ; Fri, 10 Jan 2020 06:21:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=themaw.net header.i=@themaw.net header.b="s3Rok71B"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="dWPdyE5i" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731487AbgAJGVH (ORCPT ); Fri, 10 Jan 2020 01:21:07 -0500 Received: from new2-smtp.messagingengine.com ([66.111.4.224]:37923 "EHLO new2-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731474AbgAJGVH (ORCPT ); Fri, 10 Jan 2020 01:21:07 -0500 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailnew.nyi.internal (Postfix) with ESMTP id 5662B750A; Fri, 10 Jan 2020 01:21:05 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute1.internal (MEProxy); Fri, 10 Jan 2020 01:21:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=themaw.net; h= message-id:subject:from:to:cc:date:in-reply-to:references :content-type:mime-version:content-transfer-encoding; s=fm2; bh= b+ZGcb0/KVYZkQOlq20JAB6Sn30afLRg5IIrioZNDZ8=; b=s3Rok71ByVn9eVvq u4+98kagxaqDcIhtphgolhxfCXBCNDCSURX+1CWoi1Uu4hy13+Ex3cQfsd3bSRic NKWdKEHEhyBcMpDd1ErnjBKl3G2Rdf4pU1dM3tdCYAi9Tq6BD+XYc1P62NVM1pfk CLUnDZFlMS9h4TJ2+0k1E854cukRqR9ZHOQgfKyrWKmGEscNmwA682mAkYVEtrmR oJE2s6VggThR3nqP2XQcHHokGXh5Hh6nvFeTOnoH9tGQfgPOxRWaPRwM2jaus67/ AfM6vi2yN1GtGTUzNqLtb12rlo+TMYl9c7UhvG5RoJjQsjRXQmR2+aP1hWMdxRtP 3GnK7A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=b+ZGcb0/KVYZkQOlq20JAB6Sn30afLRg5IIrioZND Z8=; b=dWPdyE5i6jrc845fcUl2SC1og32KDL+4p1Ur8234QsYGCS8lgkbwiLqna 0VlKgtPR+MJepSj04lW+15OgjbW/ppTT+aSSZOgmCqYvmPvSETpyWh4V1JJq8Jse HaTxmFqownOAQ2ZfyW6tclCf80UY5+Z2ev+zwd2bA9miNpEM/jV1IM/Nct7acs/a Hvl5TCHeKClTTnpBnPWZ36ibYgk5RLETF774tcEb3jf1+xFmfWofRjHuc5g1D5k3 e4TMEQ41IU+X5s/Ie78jjtUWA65y5yZnhb5xUllPmoQuhPQnKSQ0eC6ME5wB1vPC wMAKe3SMFJ5LnykQDzPcjtwmnPpgg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdeivddgkeelucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkffuhffvffgjfhgtfggggfesthejredttderjeenucfhrhhomhepkfgrnhcu mfgvnhhtuceorhgrvhgvnhesthhhvghmrgifrdhnvghtqeenucffohhmrghinhepthhuhh hsrdhorhhgnecukfhppeduudekrddvtdelrddujeehrddvheenucfrrghrrghmpehmrghi lhhfrhhomheprhgrvhgvnhesthhhvghmrgifrdhnvghtnecuvehluhhsthgvrhfuihiivg eptd X-ME-Proxy: Received: from mickey.themaw.net (unknown [118.209.175.25]) by mail.messagingengine.com (Postfix) with ESMTPA id AB1A530600A8; Fri, 10 Jan 2020 01:20:59 -0500 (EST) Message-ID: <979cf680b0fbdce515293a3449d564690cde6a3f.camel@themaw.net> Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks From: Ian Kent To: Al Viro , Linus Torvalds Cc: Aleksa Sarai , David Howells , Eric Biederman , stable , Christian Brauner , Serge Hallyn , dev@opencontainers.org, Linux Containers , Linux API , linux-fsdevel , Linux Kernel Mailing List Date: Fri, 10 Jan 2020 14:20:55 +0800 In-Reply-To: <20200110041523.GK8904@ZenIV.linux.org.uk> References: <20200101005446.GH4203@ZenIV.linux.org.uk> <20200101030815.GA17593@ZenIV.linux.org.uk> <20200101144407.ugjwzk7zxrucaa6a@yavin.dot.cyphar.com> <20200101234009.GB8904@ZenIV.linux.org.uk> <20200102035920.dsycgxnb6ba2jhz2@yavin.dot.cyphar.com> <20200103014901.GC8904@ZenIV.linux.org.uk> <20200108031314.GE8904@ZenIV.linux.org.uk> <20200108213444.GF8904@ZenIV.linux.org.uk> <20200110041523.GK8904@ZenIV.linux.org.uk> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.32.5 (3.32.5-1.fc30) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote: > On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote: > > On Wed, Jan 8, 2020 at 1:34 PM Al Viro > > wrote: > > > The point is, we'd never followed mounts on /proc/self/cwd et.al. > > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from > > > me) > > > is that way. > > > > Hmm. If that's the case, maybe they should be marked implicitly as > > O_PATH when opened? > > I thought you wanted O_PATH as starting point to have mounts > traversed? > Confused... > > > > Actually, scratch that - 2.0 behaves the same way > > > (mountpoint crossing is done in iget() there; is that Minix > > > influence > > > or straight from the Lions' book?) > > > > I don't think I ever had access to Lions' - I've _seen_ a printout > > of > > it later, and obviously maybe others did, > > > > More likely it's from Maurice Bach: the Design of the Unix > > Operating > > System. I'm pretty sure that's where a lot of the FS layer stuff > > came > > from. Certainly the bad old buffer head interfaces, and quite > > likely > > the iget() stuff too. > > > > > 0.10: forward traversal in iget(), back traversal in > > > fs/namei.c:find_entry() > > > > Whee, you _really_ went back in time. > > > > So I did too. > > > > And looking at that code in iget(), I doubt it came from anywhere. > > Christ. It's just looping over a fixed-size array, both when > > finding > > the inode, and finding the superblock. > > > > Cute, but unbelievably stupid. It was a more innocent time. > > > > In other words, I think you can chalk it up to just me, because > > blaming anybody else for that garbage would be very very unfair > > indeed > > ;) > > See > https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c > Exactly the same algorithm, complete with linear searches over those > fixed-sized array. > > Right, he simply transcribes v7 iget(). > > So I suspect that you are right - your variant of iget was pretty > much > one-to-one implementation of Bach's description of v7 iget. > > Your namei wasn't - Bach has 'if the entry points to root and you are > in the root and name is "..", find mount table entry (by device > number), > drop your directory inode, grab the inode of mountpount and restart > the search for ".." in there', which gives back traversals to > arbitrary > depth. And v7 namei() (as Bach mentions) uses iget() for starting > point > as well as for each component. You kept pointers instead, which is > where > the other difference has come from (no mount traversal at the > starting > point)... > > Actually, I've misread your code in 0.10 - it does unlimited forward > traversals; it's back traversals that go only one level. The forward > ones got limited to one level in 0.95, but then mount-over-root had > been banned all along. I'd read the pre-dcache variant of iget(), > seen it go pretty much all the way back to beginning and hadn't > sorted out the 0.12 -> 0.95 transition... > > > > How would your proposal deal with access("/proc/self/fd/42/foo", > > > MAY_READ) > > > vs. faccessat(42, "foo", MAY_READ)? > > > > I think that in a perfect world, the O_PATH'ness of '42' would be > > the > > deciding factor. Wouldn't those be the best and most consistent > > semantics? > > > > And then 'cwd'/'root' always have the O_PATH behavior. > > See above - unless I'm misparsing you, you wanted mount traversals in > the > starting point if it's ...at() with O_PATH fd. With O_PATH open() > not > doing them. > > For cwd and root the situation is opposite - we do NOT traverse > mounts > for those. And that's really too late to change. > > > > The latter would trigger automount, > > > the former would not... Or would you extend that to "traverse > > > mounts > > > upon following procfs links, if the file in question had been > > > opened with > > > O_PATH"? > > > > Exactly. > > > > But you know what? I do not believe this is all that important, and > > I > > doubt it will matter to anybody. > > FWIW, digging through the automount-related parts of that stuff has > caught several fun issues. One (and I'm rather embarrassed by it) > should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount > overput on simultaneous automount). To quote the commit message: > The problem is that lock_mount() drops the caller's reference to > the > mountpoint's vfsmount in the case where it finds something > already mounted on > the mountpoint as it transits to the mounted filesystem and > replaces path->mnt > with the new mountpoint vfsmount. > > During a pathwalk, however, we don't take a reference on the > vfsmount if it is > the same as the one in the nameidata struct, but do_add_mount() > doesn't know > this. > At which point I should've gone "what the fuck?" - lock_mount() does, > indeed, > drop path->mnt in this situation and replaces it with the whatever's > come to > cover it. For mount(2) that's the right thing to do - we _want_ to > mount > on top of whatever we have at the mountpoint. For automounts we very > much > don't want that - it's either "mount right on top of the automount > trigger" > or discard whatever we'd been about to mount and walk into whatever's > got > mounted there (presumably the same thing triggered by another > process). > We kinda-sorta get that effect, but in a very convoluted way: > do_add_mount() > will refuse to mount something on top of itself - > /* Refuse the same filesystem on the same mount point */ > err = -EBUSY; > if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && > path->mnt->mnt_root == path->dentry) > goto unlock; > which will end up with -EBUSY returned (and recognized by > follow_automount()). > > First of all, that's unreliable. If somebody not only has triggered > that > automount, but managed to _mount_ something else on top (for example, > has triggered it by lookup of mountpoint-to-be in mount(2)), we'll > end > up not triggering that check. In which case we'll get something like > nfs referral point under nfs automounted there under tmpfs from > explicit > overmount under same nfs mount we'd automounted there - identical to > what's > been buried under tmpfs. It's hard to hit, but not impossibly so. > > What's more, the whole solution is a kludge - the root of problem is > that lock_mount() is the wrong thing to do in case of > finish_automount(). > We don't want to go into whatever's overmounting us there, both for > the reasons above *and* because it's a PITA for the caller. So the > right solution is > * lift lock_mount() call from do_add_mount() into its callers > (all 2 of them); while we are at it, lift unlock_mount() as well > (makes for simpler failure exits in do_add_mount()). > * replace the call of lock_mount() in finish_automount() > with variant that doesn't do "unlock, walk deeper and retry locking", > returning ERR_PTR(-EBUSY) in such case. > * get rid of the kludge introduced in that commit. Better > yet, don't bother with traversing into the covering mount in case > of success - let the caller of follow_automount() do that. Which > eliminates the need to pass need_mntput to the sucker and suggests > an even better solution - have this analogue of lock_mount() > return NULL instead of ERR_PTR(-EBUSY) and treat it in > finish_automount() > as "OK, discard what we wanted to mount and return 0". That gets > rid of the entire > err = finish_automount(mnt, path); > switch (err) { > case -EBUSY: > /* Someone else made a mount here whilst we were busy > */ > return 0; > case 0: > path_put(path); > path->mnt = mnt; > path->dentry = dget(mnt->mnt_root); > return 0; > default: > return err; > } > chunk in follow_automount() - it would just be > return finish_automount(mnt, path); > > Another thing (in the same area) is not a bug per se, but... > after the call of ->d_automount() we have this: > if (IS_ERR(mnt)) { > /* > * The filesystem is allowed to return -EISDIR here > to indicate > * it doesn't want to automount. For instance, > autofs would do > * this so that its userspace daemon can mount on > this dentry. > * > * However, we can only permit this if it's a > terminal point in > * the path being looked up; if it wasn't then the > remainder of > * the path is inaccessible and we should say so. > */ > if (PTR_ERR(mnt) == -EISDIR && (nd->flags & > LOOKUP_PARENT)) > return -EREMOTE; > return PTR_ERR(mnt); > } > Except that not a single instance of ->d_automount() has ever > returned > -EISDIR. Certainly not autofs one, despite the what the comment > says. > That chunk has come from dhowells, back when the whole mount trap > series > had been merged. After talking that thing over (fun: trying to > figure > out what had been intended nearly 9 years ago, when people involved > are > in UK, US east coast and AU west coast respectively. The only way it > could suck more would've been if I were on the west coast - then all > timezone deltas would be 8-hour ones)... looks like it's a rudiment > of plans that got superseded during the series development, nobody > quite remembers exact details. Conclusion: it's not even dead, it's > stillborn; bury it. Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time we get there it's not relevant any more, so that check looks redundant. I'm not aware of any other fs automount implementation that needs that EISDIR pass-thru function. I didn't notice it at the time of the merge, sorry about that. While we're at it that: if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE; at the top of follow_automount() isn't going to be be relevant for autofs because ->d_automount() really must always be defined for it. But, at the time of the merge, I didn't object to it because there were (are) other file systems that use the VFS automount function which may accidentally not define the method. > > Unfortunately, there are other interesting questions related to > autofs-specific bits (->d_manage()) and the timezone-related fun > is, of course, still there. I hope to sort that out today or > tomorrow, at least enough to do a reasonable set of backportable > fixes to put in front of follow_managed()/step_into() queue. > Oh, well... Yeah, I know it slows you down but I kink-off like having a chance to look at what's going and think about your questions before trying to answer them, rather than replying prematurely, as I usually do ... It's been a bit of a busy day so far but I'm getting to look into the questions you've asked. Ian