From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1030652AbbDWSIH (ORCPT <rfc822;w@1wt.eu>);
	Thu, 23 Apr 2015 14:08:07 -0400
Received: from zeniv.linux.org.uk ([195.92.253.2]:50196 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1030354AbbDWSIA (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 23 Apr 2015 14:08:00 -0400
Date: Thu, 23 Apr 2015 19:07:55 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing
 stack footprint
Message-ID: <20150423180754.GA889@ZenIV.linux.org.uk>
References: <20150420181222.GK889@ZenIV.linux.org.uk>
 <20150421144959.GR889@ZenIV.linux.org.uk>
 <20150421150408.GA29838@infradead.org>
 <553668C1.8030707@nod.at>
 <20150421154504.GT889@ZenIV.linux.org.uk>
 <20150421212007.GU889@ZenIV.linux.org.uk>
 <20150422180702.GA15209@ZenIV.linux.org.uk>
 <20150422201238.GW889@ZenIV.linux.org.uk>
 <20150422210553.GX889@ZenIV.linux.org.uk>
 <20150423174544.48ae3122@notabene.brown>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150423174544.48ae3122@notabene.brown>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:

> follow_link calls  link_path_walk -> walk_component -> lookup_fast which sets
> nd->seq.  Is that not enough?  I guess not when nd_jump_link is called.  Is
> that what I missed?

No.  Potential nd_jump_link() callers are just refusing to go there in lazy
mode, end of story.  That's not the problem; see below.

> One thing that is clear to me is that I don't really understand all the
> handling of 'seq' numbers, making me unable to comment usefully.
> I'll try to go through the current code and you current patch with that issue
> in mind and make sure I do understand it.  Then I'll try to comment.

OK, here's the basic theory behind the lazy pathwalk:

* during the entire exercise we never drop rcu_read_lock, therefore any
objects that have all references to them removed before RCU-delayed
freeing (inodes, dentries, superblocks and vfsmounts are among such)
that we might find in process won't be freed until after we are done.

* invariant we are keeping:
	at some point between the beginning of walk and now the pathname
traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
->d_seq and ->d_inode of the dentry in question.

* path_init() arranges for that to be true in the beginning of the walk.

* symlinks aside, walk_component() preserves that.
	+ normal name components go through lookup_fast(), where we have
	  __d_lookup_rcu() find a child of current nd->path with matching
	  name and (atomically) picks ->d_seq of that child, which had the
	  name matching our component.  Atomicity is provided by ->d_lock
	  on child.  Then we proceed to pick ->d_inode (and verify that
	  ->d_seq has not changed, thus making sure that ->d_inode value
	  at the moment when the name matched had been the same and child
	  is still hashed.  Then we verify that parent's ->d_seq has not
	  changed, guaranteeing that parent hadn't been moved or unhashed
	  from the moment we'd found it until after we'd found its child.
	  Assuming nothing's mounted on top of that thing, and there's no
	  problem with ->d_revalidate()), that's it - we have new nd->seq,
	  nd->path and nd->inode satisfying our invariant for longer
	  piece of pathname.
	+ "." needs nothing - we just stay where we are
	+ ".." is handled by follow_dotdot_rcu(), which in the normal case
	  (no mountpoint crossing) picks ->d_parent of where we are,
	  fetches its ->d_inode and ->d_seq and verifies that our directory
	  still hadn't changed _its_ ->d_seq.  The last part guarantees that
	  it hadn't been moved since the time we'd found it, and thus its
	  ->d_parent had remained unchanged at least until that verification.
	  Therefore, it remained pinned all along, and it ->d_inode had
	  remained stable, including the moment when we fetched ->d_seq.
	  Which means that the value we had picked *and* its ->d_inode and
	  ->d_seq would satisfy the invariant for the longer piece of
	  pathname.
	+ mountpoint crossing towards leaves is handled in __follow_mount_rcu();
	  it is simple (->mnt_root never changes and is always pinned,
	  stabilizing its ->d_inode), but we might need to worry about
	  automount points *and* need to make sure that we stop right
	  there if mount_lock has been bumped.  See commit b37199e6 for
	  the details on the last part - basically, making sure that false
	  negatives from __lookup_mnt() won't end up with hard error when
	  we walk into whatever had been under the mountpoint we'd missed.
	+ mountpoint crossing towards root (in follow_dotdot_rcu()) is
	  similar, but there we obviously don't care about automounts.
	  Looking at it now, it might make sense to recheck mount_lock there
	  as well, though - potential danger is to hit the moment when
	  mnt_parent and mnt_mountpoint are out of sync, leaving us with
	  mismatched vfsmount,dentry pair.  Generally, that will be caught
	  when we try to leave lazy mode (legitimize_mnt() will fail) or
	  to cross towards leaves, but the next crossing towards root
	  might be bogus as well, and we could end up with unwarranted hard
	  error.  Should be very hard to hit, but it's easy enough to check
	  *and* backport, so it looks like a good idea for -stable.  Linus,
	  do you have any objections against adding
                if (read_seqretry(&mount_lock, nd->m_seq))
                        goto failed;
	  right after
                if (!follow_up_rcu(&nd->path))
                        break;
	  in follow_dotdot_rcu()?  It's a very narrow race with mount --move
	  and most of the time it ends up being completely harmless, but
	  it's possible to construct a case when we'll get a bogus hard error
	  instead of falling back to non-lazy walk...  OTOH, anyone doing
	  that will get something inherently timing-dependent as the result,
	  lazy mode or not.  I'm in favour of adding that check, but IMO it's
	  not a critical problem.

* if we find that we can't continue in lazy mode because some verification
fails, we just fail with -ECHILD.  However, in cases when our current position
might be fine, but the next step can't be done in lazy mode, we attempt to
fall back to non-lazy without full restart that would be caused by -ECHILD.
That's what unlazy_walk() is.  Of course, if we reach the end of walk we
need to leave the lazy mode as well (complete_walk()).  Either can fail,
and such a failure means restart from scratch in non-lazy mode.  We need
to grab references on vfsmount and dentry (or two dentries, when we have
parent and child to deal with).  The interesting part is vfsmount - we need
to make sure we won't interfere with umount(2), having walked in that sucker
*after* umount(2) has checked that it's not busy.  See legitimize_mnt() for
details; basically, we have umount(2) mark the "I've verified it's not busy
and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
on RCU delays on that path - if we find one of those, we undo the
increment of its refcount we'd just done, without having dropped
rcu_read_lock().  Full-blown mntput() is done only on mismatches that
had _not_ been marked that way.

BTW, something similar to the above probably needs to be turned into coherent
text, either in parallel to Documentation/filesystems/path-lookup.txt, or
as update to it.

The reason why walk_component() in your call chain won't suffice is simple -
it will fail this
                /*
                 * This sequence count validates that the parent had no
                 * changes while we did the lookup of the dentry above.
                 *
                 * The memory barrier in read_seqcount_begin of child is
                 *  enough, we can use __read_seqcount_retry here.
                 */
                if (__read_seqcount_retry(&parent->d_seq, nd->seq))
                        return -ECHILD;
in lookup_fast(), just before updating nd->seq to new value.  Basically,
it has no way to tell if parent has been buggered to hell and back.

It's not _that_ hard to prevent - we just need to stop discarding the parent's
seq number in "need to follow a symlink" case of walk_component().  Will take
some massage, but not too much...