From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932287AbbDXGfr (ORCPT <rfc822;w@1wt.eu>);
	Fri, 24 Apr 2015 02:35:47 -0400
Received: from cantor2.suse.de ([195.135.220.15]:34818 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932076AbbDXGfq (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 24 Apr 2015 02:35:46 -0400
Date: Fri, 24 Apr 2015 16:35:34 +1000
From: NeilBrown <neilb@suse.de>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing
 stack footprint
Message-ID: <20150424163534.6eb109eb@notabene.brown>
In-Reply-To: <20150423180754.GA889@ZenIV.linux.org.uk>
References: <20150420181222.GK889@ZenIV.linux.org.uk>
	<20150421144959.GR889@ZenIV.linux.org.uk>
	<20150421150408.GA29838@infradead.org>
	<553668C1.8030707@nod.at>
	<20150421154504.GT889@ZenIV.linux.org.uk>
	<20150421212007.GU889@ZenIV.linux.org.uk>
	<20150422180702.GA15209@ZenIV.linux.org.uk>
	<20150422201238.GW889@ZenIV.linux.org.uk>
	<20150422210553.GX889@ZenIV.linux.org.uk>
	<20150423174544.48ae3122@notabene.brown>
	<20150423180754.GA889@ZenIV.linux.org.uk>
X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu)
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 boundary="Sig_/e_LDiU38DlgFNAZsf6dvkgZ"; protocol="application/pgp-signature"
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

--Sig_/e_LDiU38DlgFNAZsf6dvkgZ
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:
>=20
> > follow_link calls  link_path_walk -> walk_component -> lookup_fast whic=
h sets
> > nd->seq.  Is that not enough?  I guess not when nd_jump_link is called.=
  Is
> > that what I missed?
>=20
> No.  Potential nd_jump_link() callers are just refusing to go there in la=
zy
> mode, end of story.  That's not the problem; see below.
>=20
> > One thing that is clear to me is that I don't really understand all the
> > handling of 'seq' numbers, making me unable to comment usefully.
> > I'll try to go through the current code and you current patch with that=
 issue
> > in mind and make sure I do understand it.  Then I'll try to comment.
>=20
> OK, here's the basic theory behind the lazy pathwalk:
>=20
> * during the entire exercise we never drop rcu_read_lock, therefore any
> objects that have all references to them removed before RCU-delayed
> freeing (inodes, dentries, superblocks and vfsmounts are among such)
> that we might find in process won't be freed until after we are done.
>=20
> * invariant we are keeping:
> 	at some point between the beginning of walk and now the pathname
> traversed so far would lead to nd->path, with nd->seq and nd->inode equal=
 to
> ->d_seq and ->d_inode of the dentry in question.

Thanks for all of this!
I think one part that was confusing me is that there is a place where nd->s=
eq
does related to nd->path.dentry.
Usually  the two are updated at the same time.
However updated nd->seq, but *doesn't* update nd->path.dentry.
Instead, the dentry is stored in a separate 'path' variable.
This discrepancy is eventually resolved in a call to path_to_nameidata().

Which is a bit confusing until you know what is happening :-)


>=20
> * path_init() arranges for that to be true in the beginning of the walk.
>=20
> * symlinks aside, walk_component() preserves that.
> 	+ normal name components go through lookup_fast(), where we have
> 	  __d_lookup_rcu() find a child of current nd->path with matching
> 	  name and (atomically) picks ->d_seq of that child, which had the
> 	  name matching our component.  Atomicity is provided by ->d_lock
> 	  on child.  Then we proceed to pick ->d_inode (and verify that

I don't see d_lock being taken  in __d_lookup_rcu.
I think this atomicity is provided by ->d_seq on the child.


> 	  ->d_seq has not changed, thus making sure that ->d_inode value
> 	  at the moment when the name matched had been the same and child
> 	  is still hashed.  Then we verify that parent's ->d_seq has not
> 	  changed, guaranteeing that parent hadn't been moved or unhashed
> 	  from the moment we'd found it until after we'd found its child.
> 	  Assuming nothing's mounted on top of that thing, and there's no
> 	  problem with ->d_revalidate()), that's it - we have new nd->seq,
> 	  nd->path and nd->inode satisfying our invariant for longer
> 	  piece of pathname.
> 	+ "." needs nothing - we just stay where we are
> 	+ ".." is handled by follow_dotdot_rcu(), which in the normal case
> 	  (no mountpoint crossing) picks ->d_parent of where we are,
> 	  fetches its ->d_inode and ->d_seq and verifies that our directory
> 	  still hadn't changed _its_ ->d_seq.  The last part guarantees that
> 	  it hadn't been moved since the time we'd found it, and thus its
> 	  ->d_parent had remained unchanged at least until that verification.
> 	  Therefore, it remained pinned all along, and it ->d_inode had
> 	  remained stable, including the moment when we fetched ->d_seq.
> 	  Which means that the value we had picked *and* its ->d_inode and
> 	  ->d_seq would satisfy the invariant for the longer piece of
> 	  pathname.
> 	+ mountpoint crossing towards leaves is handled in __follow_mount_rcu();
> 	  it is simple (->mnt_root never changes and is always pinned,
> 	  stabilizing its ->d_inode), but we might need to worry about
> 	  automount points *and* need to make sure that we stop right
> 	  there if mount_lock has been bumped.  See commit b37199e6 for
> 	  the details on the last part - basically, making sure that false
> 	  negatives from __lookup_mnt() won't end up with hard error when
> 	  we walk into whatever had been under the mountpoint we'd missed.
> 	+ mountpoint crossing towards root (in follow_dotdot_rcu()) is
> 	  similar, but there we obviously don't care about automounts.
> 	  Looking at it now, it might make sense to recheck mount_lock there
> 	  as well, though - potential danger is to hit the moment when
> 	  mnt_parent and mnt_mountpoint are out of sync, leaving us with
> 	  mismatched vfsmount,dentry pair.  Generally, that will be caught
> 	  when we try to leave lazy mode (legitimize_mnt() will fail) or
> 	  to cross towards leaves, but the next crossing towards root
> 	  might be bogus as well, and we could end up with unwarranted hard
> 	  error.  Should be very hard to hit, but it's easy enough to check
> 	  *and* backport, so it looks like a good idea for -stable.  Linus,
> 	  do you have any objections against adding
>                 if (read_seqretry(&mount_lock, nd->m_seq))
>                         goto failed;
> 	  right after
>                 if (!follow_up_rcu(&nd->path))
>                         break;
> 	  in follow_dotdot_rcu()?  It's a very narrow race with mount --move
> 	  and most of the time it ends up being completely harmless, but
> 	  it's possible to construct a case when we'll get a bogus hard error
> 	  instead of falling back to non-lazy walk...  OTOH, anyone doing
> 	  that will get something inherently timing-dependent as the result,
> 	  lazy mode or not.  I'm in favour of adding that check, but IMO it's
> 	  not a critical problem.
>=20
> * if we find that we can't continue in lazy mode because some verification
> fails, we just fail with -ECHILD.  However, in cases when our current pos=
ition
> might be fine, but the next step can't be done in lazy mode, we attempt to
> fall back to non-lazy without full restart that would be caused by -ECHIL=
D.
> That's what unlazy_walk() is.  Of course, if we reach the end of walk we
> need to leave the lazy mode as well (complete_walk()).  Either can fail,
> and such a failure means restart from scratch in non-lazy mode.  We need
> to grab references on vfsmount and dentry (or two dentries, when we have
> parent and child to deal with).  The interesting part is vfsmount - we ne=
ed
> to make sure we won't interfere with umount(2), having walked in that suc=
ker
> *after* umount(2) has checked that it's not busy.  See legitimize_mnt() f=
or
> details; basically, we have umount(2) mark the "I've verified it's not bu=
sy
> and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
> on RCU delays on that path - if we find one of those, we undo the
> increment of its refcount we'd just done, without having dropped
> rcu_read_lock().  Full-blown mntput() is done only on mismatches that
> had _not_ been marked that way.
>=20
> BTW, something similar to the above probably needs to be turned into cohe=
rent
> text, either in parallel to Documentation/filesystems/path-lookup.txt, or
> as update to it.

That might be fun.... if I find time.


>=20
> The reason why walk_component() in your call chain won't suffice is simpl=
e -
> it will fail this
>                 /*
>                  * This sequence count validates that the parent had no
>                  * changes while we did the lookup of the dentry above.
>                  *
>                  * The memory barrier in read_seqcount_begin of child is
>                  *  enough, we can use __read_seqcount_retry here.
>                  */
>                 if (__read_seqcount_retry(&parent->d_seq, nd->seq))
>                         return -ECHILD;
> in lookup_fast(), just before updating nd->seq to new value.  Basically,
> it has no way to tell if parent has been buggered to hell and back.
>=20
> It's not _that_ hard to prevent - we just need to stop discarding the par=
ent's
> seq number in "need to follow a symlink" case of walk_component().  Will =
take
> some massage, but not too much...

So.....
Where I have:

+                       if (nd->flags & LOOKUP_RCU) {
+                               if (!nd->root.mnt)
+                                       set_root_rcu(nd);
+                               nd->path =3D nd->root;


in the case where the symlink starts '/', I need to set nd->seq

		nd->seq =3D read_seqcount_begin(&nd->path.dentry->d_seq);


but in the case where symlink doesn't start '/', I don't change nd->path,
so nd->seq should still be correct?

Also, just after that (still in follow_link()), we have

		nd->inode =3D nd->path.dentry->d_inode;

*after* the test for (*s =3D=3D '/').
Wouldn't is be OK to have that *inside* the if?  In the relative-symlink ca=
se
it should be unchanged, and in the jump-symlink case nd_jump_link() has alr=
eady
done that (and probably NULL was returns for 's' so this code doesn't execu=
te).

So following on top of my previous patches?

Thanks heaps,
NeilBrown


diff --git a/fs/namei.c b/fs/namei.c
index d13b4315447f..ce6387d5317c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, vo=
id **p)
 				if (!nd->root.mnt)
 					set_root_rcu(nd);
 				nd->path =3D nd->root;
+				nd->seq =3D read_seqcount_begin(
+					&nd->path->dentry->d_seq);
 			} else {
 				if (!nd->root.mnt)
 					set_root(nd);
@@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, vo=
id **p)
 				nd->path =3D nd->root;
 				path_get(&nd->root);
 			}
+			nd->inode =3D nd->path.dentry->d_inode;
 			nd->flags |=3D LOOKUP_JUMPED;
 		}
-		nd->inode =3D nd->path.dentry->d_inode;
 		error =3D link_path_walk(s, nd);
 		if (unlikely(error))
 			put_link(nd, link, inode, *p);

--Sig_/e_LDiU38DlgFNAZsf6dvkgZ
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIVAwUBVTnkNjnsnt1WYoG5AQLDshAAotMwCa1TBd0aEngsFTWRw5fdFU/B+Gc4
N5c3joKhkEl9jjPJVQd7Uq7wbr03nZ80RDbrhFMFw5rN9e0RTqtY8e2OQT/0uiTJ
Llsf/97anPWR0EpRCLWvGHWJrpiwptkgAo3jrWk9zAg44tSB89xNnRrwOV4XhqEp
aqZsUve0WTs1vq7mOyVLnBAkd1tUv0UTRR4P7lMgkhkS2FUfL85t5jB9BgPCBf3U
GAYidYzM6qYaFsi3oTI3q0CI+1dsSKDTuH6e2q6J3quWT+YQsIkPYKn0EW8B8g3/
VyyX9ybJ3w/doIl/p6Z/OpDXskKhlkKbYXYNKtJzKj/L7JmuK69Y7YNupcs76sJD
enevBW1Yc////dgJLj6yukBRP8W+hY1NrYSODrjXWdUb9/Amj8RU4bdX0JqU3BjV
RlNwcW7SSiIEa+tUDzp1sttxdAY59BAb8nIT4I1HFwthbzmf7bYZIgNXScpm3HF0
7gV/6rThrLNPD18d4Xck65Zl1ISgsLEERbcINhL+wo/wNs++kemj0Pzqui6kFt0R
gHIZt6mPgrhE2nfN6brH7GOJRV9jKkf4rsrqohJYIF+EPl6vSprtUiJRylQn6KEM
k70fln6Mw9TSHYzEQz7wY571fMo8wutpLFLreim6XMwUiy9z1yX33wBIdTPMy0aT
tHIcdzhCcOE=
=w94J
-----END PGP SIGNATURE-----

--Sig_/e_LDiU38DlgFNAZsf6dvkgZ--