From mboxrd@z Thu Jan  1 00:00:00 1970
From: ebiederm@xmission.com (Eric W. Biederman)
Subject: Re: [PATCH review 4/6] mnt: Track when a directory escapes a bind mount
Date: Thu, 13 Aug 2015 23:10:26 -0500
Message-ID: <877foymrwt.fsf@x220.int.ebiederm.org>
References: <871tncuaf6.fsf@x220.int.ebiederm.org>
	<87mw5xq7lt.fsf@x220.int.ebiederm.org>
	<87a8yqou41.fsf_-_@x220.int.ebiederm.org>
	<874moq9oyb.fsf_-_@x220.int.ebiederm.org>
	<871tfkawu9.fsf_-_@x220.int.ebiederm.org>
	<87egjk9i61.fsf_-_@x220.int.ebiederm.org>
	<20150810043637.GC14139@ZenIV.linux.org.uk>
Mime-Version: 1.0
Content-Type: text/plain
Cc: Linux Containers <containers@lists.linux-foundation.org>,
	linux-fsdevel@vger.kernel.org,
	Andy Lutomirski <luto@amacapital.net>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	Richard Weinberger <richard@nod.at>,
	Andrey Vagin <avagin@openvz.org>, Jann Horn <jann@thejh.net>,
	Willy Tarreau <w@1wt.eu>, Omar Sandoval <osandov@osandov.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"J. Bruce Fields" <bfields@fieldses.org>
To: Al Viro <viro@ZenIV.linux.org.uk>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from out01.mta.xmission.com ([166.70.13.231]:49662 "EHLO
	out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751260AbbHNERZ (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 14 Aug 2015 00:17:25 -0400
In-Reply-To: <20150810043637.GC14139@ZenIV.linux.org.uk> (Al Viro's message of
	"Mon, 10 Aug 2015 05:36:37 +0100")
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote:
>
>> - The escape count on struct mount must be incremented both before the
>>   rename and after.  If the count is not incremented before the rename
>>   it is possible to hit a scenario where the rename happens the code
>>   walks up the directory tree to somewhere outside of the bind mount
>>   before the count is touched.  Similary without a count after the
>>   rename it is possible for the code to look at the escape count
>>   validate a path is connected before the rename and assume cache the
>>   escape count, leading to not retesting the path is ok.
>
> Umm...  I wonder if you are overcomplicating the things here.  Sure,
> I understand wanting to reduce the checks on "..", but...  It costs you
> considerable complexity (especially when it comes to 64bit counts),
> it's really brittle (you need to be very careful about the places where
> you zero the cached values in fs/namei.c and missing one will lead to
> really unpleasant effects there) _and_ it's all for the benefit of 
> a very rare case.  With check you are optimizing away not being all that
> costly anyway.

I had to give this a long hard think.  Algorithms going to O(N^2) when
it is uncessarry really bother me.  I ran some numbers for really deep
directory trees, slow memory, etc and I could not come up with a
scenario where even in it's worst case d_ancestor would take anywhere
near as long as a one disk seek, and most of the d_ancestor would be
much quicker.

So it appears to me that in the worst case a pathname lookup consisting
of a ridiculous number of .. components starting with a cold cache, on a
mount where a directory has escaped is likely to be faster than a
similar lookup going down the tree with many disk seeks.

I don't think the 64bit counts and the zeroing the cache values are
quite as bad as you make out.  There are much trickier things already in
path name lookup code.  But I do agree that it is easy to get wrong
because nothing will show up in testing, and getting it wrong will have
really unpleasant effects.

I also can't see a scenario where a directory would escape a subtree
that is mounted somewhere without it being a misconfiguration.

So I agree it is not worth it to optimize the code so that there
are an absolute minimum number of d_ancestor calls during pathname
lookup.

Further replacing mnt_escape_count with a mnt_flag makes the code much
simpler.  Which I very much appreciate.

>> - The largest change is in d_unalias, where the two cases are split
>>   apart so they can be handled separately.  In the easy case of a
>>   rename within the same directory all that is needed is __d_move
>>   (escaping a mount is impossible in that case).  In the more involved
>>   case mutexes need to be acquired, and now the spin locks need to be
>>   dropped so that proper lock aquisition order around __d_move can be
>>   arranged.
>
> I _really_ hate that part.  Could you explain WTF is wrong with simply
> taking mount_lock in that case of __d_splice_alias() just outside of
> rename_lock?

Me too.  So upon realizing the that inode->i_lock is held longer than
necessary in d_splice_alias I reworked the locking in d_splice_alias.

Updated patches to follow in a little bit.

> PS: that thing should be in fs/dcache.c, at least in the part that
> deals with finding the common ancestor, etc.  And __d_move() (and
> dentry_lock_for_move()) games with d_ancestor() should be redundant
> now.

It does seem reasonable that the BUG_ONs in __d_move that call
d_ancestor can be removed, or simplified by passing the common ancestor
into __d_move.

I don't know the code in dentry_lock_for_move well enough to say
anything except that the d_ancestor call in dentry_lock_for_move looks
reasonable.

Doing anything inside of __d_move or dentry_lock_for_move appears
to be a detour to the cause of preventing escaping from bind mounts.
So while I have no problems with the with the kinds of changes I hear
you suggesting, but unless I encounter something that makes changing
__d_move or dentry_lock_for_more relevant to the work of preventing
escaping from bind mounts I don't plan on touching them while that
is my focus.

Eric