Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint

From: NeilBrown <neilb@suse.de>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint
Date: Fri, 24 Apr 2015 16:35:34 +1000	[thread overview]
Message-ID: <20150424163534.6eb109eb@notabene.brown> (raw)
In-Reply-To: <20150423180754.GA889@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 9971 bytes --]

On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:
> 
> > follow_link calls  link_path_walk -> walk_component -> lookup_fast which sets
> > nd->seq.  Is that not enough?  I guess not when nd_jump_link is called.  Is
> > that what I missed?
> 
> No.  Potential nd_jump_link() callers are just refusing to go there in lazy
> mode, end of story.  That's not the problem; see below.
> 
> > One thing that is clear to me is that I don't really understand all the
> > handling of 'seq' numbers, making me unable to comment usefully.
> > I'll try to go through the current code and you current patch with that issue
> > in mind and make sure I do understand it.  Then I'll try to comment.
> 
> OK, here's the basic theory behind the lazy pathwalk:
> 
> * during the entire exercise we never drop rcu_read_lock, therefore any
> objects that have all references to them removed before RCU-delayed
> freeing (inodes, dentries, superblocks and vfsmounts are among such)
> that we might find in process won't be freed until after we are done.
> 
> * invariant we are keeping:
> 	at some point between the beginning of walk and now the pathname
> traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
> ->d_seq and ->d_inode of the dentry in question.

Thanks for all of this!
I think one part that was confusing me is that there is a place where nd->seq
does related to nd->path.dentry.
Usually  the two are updated at the same time.
However updated nd->seq, but *doesn't* update nd->path.dentry.
Instead, the dentry is stored in a separate 'path' variable.
This discrepancy is eventually resolved in a call to path_to_nameidata().

Which is a bit confusing until you know what is happening :-)


> 
> * path_init() arranges for that to be true in the beginning of the walk.
> 
> * symlinks aside, walk_component() preserves that.
> 	+ normal name components go through lookup_fast(), where we have
> 	  __d_lookup_rcu() find a child of current nd->path with matching
> 	  name and (atomically) picks ->d_seq of that child, which had the
> 	  name matching our component.  Atomicity is provided by ->d_lock
> 	  on child.  Then we proceed to pick ->d_inode (and verify that

I don't see d_lock being taken  in __d_lookup_rcu.
I think this atomicity is provided by ->d_seq on the child.


> 	  ->d_seq has not changed, thus making sure that ->d_inode value
> 	  at the moment when the name matched had been the same and child
> 	  is still hashed.  Then we verify that parent's ->d_seq has not
> 	  changed, guaranteeing that parent hadn't been moved or unhashed
> 	  from the moment we'd found it until after we'd found its child.
> 	  Assuming nothing's mounted on top of that thing, and there's no
> 	  problem with ->d_revalidate()), that's it - we have new nd->seq,
> 	  nd->path and nd->inode satisfying our invariant for longer
> 	  piece of pathname.
> 	+ "." needs nothing - we just stay where we are
> 	+ ".." is handled by follow_dotdot_rcu(), which in the normal case
> 	  (no mountpoint crossing) picks ->d_parent of where we are,
> 	  fetches its ->d_inode and ->d_seq and verifies that our directory
> 	  still hadn't changed _its_ ->d_seq.  The last part guarantees that
> 	  it hadn't been moved since the time we'd found it, and thus its
> 	  ->d_parent had remained unchanged at least until that verification.
> 	  Therefore, it remained pinned all along, and it ->d_inode had
> 	  remained stable, including the moment when we fetched ->d_seq.
> 	  Which means that the value we had picked *and* its ->d_inode and
> 	  ->d_seq would satisfy the invariant for the longer piece of
> 	  pathname.
> 	+ mountpoint crossing towards leaves is handled in __follow_mount_rcu();
> 	  it is simple (->mnt_root never changes and is always pinned,
> 	  stabilizing its ->d_inode), but we might need to worry about
> 	  automount points *and* need to make sure that we stop right
> 	  there if mount_lock has been bumped.  See commit b37199e6 for
> 	  the details on the last part - basically, making sure that false
> 	  negatives from __lookup_mnt() won't end up with hard error when
> 	  we walk into whatever had been under the mountpoint we'd missed.
> 	+ mountpoint crossing towards root (in follow_dotdot_rcu()) is
> 	  similar, but there we obviously don't care about automounts.
> 	  Looking at it now, it might make sense to recheck mount_lock there
> 	  as well, though - potential danger is to hit the moment when
> 	  mnt_parent and mnt_mountpoint are out of sync, leaving us with
> 	  mismatched vfsmount,dentry pair.  Generally, that will be caught
> 	  when we try to leave lazy mode (legitimize_mnt() will fail) or
> 	  to cross towards leaves, but the next crossing towards root
> 	  might be bogus as well, and we could end up with unwarranted hard
> 	  error.  Should be very hard to hit, but it's easy enough to check
> 	  *and* backport, so it looks like a good idea for -stable.  Linus,
> 	  do you have any objections against adding
>                 if (read_seqretry(&mount_lock, nd->m_seq))
>                         goto failed;
> 	  right after
>                 if (!follow_up_rcu(&nd->path))
>                         break;
> 	  in follow_dotdot_rcu()?  It's a very narrow race with mount --move
> 	  and most of the time it ends up being completely harmless, but
> 	  it's possible to construct a case when we'll get a bogus hard error
> 	  instead of falling back to non-lazy walk...  OTOH, anyone doing
> 	  that will get something inherently timing-dependent as the result,
> 	  lazy mode or not.  I'm in favour of adding that check, but IMO it's
> 	  not a critical problem.
> 
> * if we find that we can't continue in lazy mode because some verification
> fails, we just fail with -ECHILD.  However, in cases when our current position
> might be fine, but the next step can't be done in lazy mode, we attempt to
> fall back to non-lazy without full restart that would be caused by -ECHILD.
> That's what unlazy_walk() is.  Of course, if we reach the end of walk we
> need to leave the lazy mode as well (complete_walk()).  Either can fail,
> and such a failure means restart from scratch in non-lazy mode.  We need
> to grab references on vfsmount and dentry (or two dentries, when we have
> parent and child to deal with).  The interesting part is vfsmount - we need
> to make sure we won't interfere with umount(2), having walked in that sucker
> *after* umount(2) has checked that it's not busy.  See legitimize_mnt() for
> details; basically, we have umount(2) mark the "I've verified it's not busy
> and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
> on RCU delays on that path - if we find one of those, we undo the
> increment of its refcount we'd just done, without having dropped
> rcu_read_lock().  Full-blown mntput() is done only on mismatches that
> had _not_ been marked that way.
> 
> BTW, something similar to the above probably needs to be turned into coherent
> text, either in parallel to Documentation/filesystems/path-lookup.txt, or
> as update to it.

That might be fun.... if I find time.


> 
> The reason why walk_component() in your call chain won't suffice is simple -
> it will fail this
>                 /*
>                  * This sequence count validates that the parent had no
>                  * changes while we did the lookup of the dentry above.
>                  *
>                  * The memory barrier in read_seqcount_begin of child is
>                  *  enough, we can use __read_seqcount_retry here.
>                  */
>                 if (__read_seqcount_retry(&parent->d_seq, nd->seq))
>                         return -ECHILD;
> in lookup_fast(), just before updating nd->seq to new value.  Basically,
> it has no way to tell if parent has been buggered to hell and back.
> 
> It's not _that_ hard to prevent - we just need to stop discarding the parent's
> seq number in "need to follow a symlink" case of walk_component().  Will take
> some massage, but not too much...

So.....
Where I have:

+                       if (nd->flags & LOOKUP_RCU) {
+                               if (!nd->root.mnt)
+                                       set_root_rcu(nd);
+                               nd->path = nd->root;


in the case where the symlink starts '/', I need to set nd->seq

		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);


but in the case where symlink doesn't start '/', I don't change nd->path,
so nd->seq should still be correct?

Also, just after that (still in follow_link()), we have

		nd->inode = nd->path.dentry->d_inode;

*after* the test for (*s == '/').
Wouldn't is be OK to have that *inside* the if?  In the relative-symlink case
it should be unchanged, and in the jump-symlink case nd_jump_link() has already
done that (and probably NULL was returns for 's' so this code doesn't execute).

So following on top of my previous patches?

Thanks heaps,
NeilBrown

diff --git a/fs/namei.c b/fs/namei.c
index d13b4315447f..ce6387d5317c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 				if (!nd->root.mnt)
 					set_root_rcu(nd);
 				nd->path = nd->root;
+				nd->seq = read_seqcount_begin(
+					&nd->path->dentry->d_seq);
 			} else {
 				if (!nd->root.mnt)
 					set_root(nd);
@@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 				nd->path = nd->root;
 				path_get(&nd->root);
 			}
+			nd->inode = nd->path.dentry->d_inode;
 			nd->flags |= LOOKUP_JUMPED;
 		}
-		nd->inode = nd->path.dentry->d_inode;
 		error = link_path_walk(s, nd);
 		if (unlikely(error))
 			put_link(nd, link, inode, *p);

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]