On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro wrote: > On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote: > > > follow_link calls link_path_walk -> walk_component -> lookup_fast which sets > > nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is > > that what I missed? > > No. Potential nd_jump_link() callers are just refusing to go there in lazy > mode, end of story. That's not the problem; see below. > > > One thing that is clear to me is that I don't really understand all the > > handling of 'seq' numbers, making me unable to comment usefully. > > I'll try to go through the current code and you current patch with that issue > > in mind and make sure I do understand it. Then I'll try to comment. > > OK, here's the basic theory behind the lazy pathwalk: > > * during the entire exercise we never drop rcu_read_lock, therefore any > objects that have all references to them removed before RCU-delayed > freeing (inodes, dentries, superblocks and vfsmounts are among such) > that we might find in process won't be freed until after we are done. > > * invariant we are keeping: > at some point between the beginning of walk and now the pathname > traversed so far would lead to nd->path, with nd->seq and nd->inode equal to > ->d_seq and ->d_inode of the dentry in question. Thanks for all of this! I think one part that was confusing me is that there is a place where nd->seq does related to nd->path.dentry. Usually the two are updated at the same time. However updated nd->seq, but *doesn't* update nd->path.dentry. Instead, the dentry is stored in a separate 'path' variable. This discrepancy is eventually resolved in a call to path_to_nameidata(). Which is a bit confusing until you know what is happening :-) > > * path_init() arranges for that to be true in the beginning of the walk. > > * symlinks aside, walk_component() preserves that. > + normal name components go through lookup_fast(), where we have > __d_lookup_rcu() find a child of current nd->path with matching > name and (atomically) picks ->d_seq of that child, which had the > name matching our component. Atomicity is provided by ->d_lock > on child. Then we proceed to pick ->d_inode (and verify that I don't see d_lock being taken in __d_lookup_rcu. I think this atomicity is provided by ->d_seq on the child. > ->d_seq has not changed, thus making sure that ->d_inode value > at the moment when the name matched had been the same and child > is still hashed. Then we verify that parent's ->d_seq has not > changed, guaranteeing that parent hadn't been moved or unhashed > from the moment we'd found it until after we'd found its child. > Assuming nothing's mounted on top of that thing, and there's no > problem with ->d_revalidate()), that's it - we have new nd->seq, > nd->path and nd->inode satisfying our invariant for longer > piece of pathname. > + "." needs nothing - we just stay where we are > + ".." is handled by follow_dotdot_rcu(), which in the normal case > (no mountpoint crossing) picks ->d_parent of where we are, > fetches its ->d_inode and ->d_seq and verifies that our directory > still hadn't changed _its_ ->d_seq. The last part guarantees that > it hadn't been moved since the time we'd found it, and thus its > ->d_parent had remained unchanged at least until that verification. > Therefore, it remained pinned all along, and it ->d_inode had > remained stable, including the moment when we fetched ->d_seq. > Which means that the value we had picked *and* its ->d_inode and > ->d_seq would satisfy the invariant for the longer piece of > pathname. > + mountpoint crossing towards leaves is handled in __follow_mount_rcu(); > it is simple (->mnt_root never changes and is always pinned, > stabilizing its ->d_inode), but we might need to worry about > automount points *and* need to make sure that we stop right > there if mount_lock has been bumped. See commit b37199e6 for > the details on the last part - basically, making sure that false > negatives from __lookup_mnt() won't end up with hard error when > we walk into whatever had been under the mountpoint we'd missed. > + mountpoint crossing towards root (in follow_dotdot_rcu()) is > similar, but there we obviously don't care about automounts. > Looking at it now, it might make sense to recheck mount_lock there > as well, though - potential danger is to hit the moment when > mnt_parent and mnt_mountpoint are out of sync, leaving us with > mismatched vfsmount,dentry pair. Generally, that will be caught > when we try to leave lazy mode (legitimize_mnt() will fail) or > to cross towards leaves, but the next crossing towards root > might be bogus as well, and we could end up with unwarranted hard > error. Should be very hard to hit, but it's easy enough to check > *and* backport, so it looks like a good idea for -stable. Linus, > do you have any objections against adding > if (read_seqretry(&mount_lock, nd->m_seq)) > goto failed; > right after > if (!follow_up_rcu(&nd->path)) > break; > in follow_dotdot_rcu()? It's a very narrow race with mount --move > and most of the time it ends up being completely harmless, but > it's possible to construct a case when we'll get a bogus hard error > instead of falling back to non-lazy walk... OTOH, anyone doing > that will get something inherently timing-dependent as the result, > lazy mode or not. I'm in favour of adding that check, but IMO it's > not a critical problem. > > * if we find that we can't continue in lazy mode because some verification > fails, we just fail with -ECHILD. However, in cases when our current position > might be fine, but the next step can't be done in lazy mode, we attempt to > fall back to non-lazy without full restart that would be caused by -ECHILD. > That's what unlazy_walk() is. Of course, if we reach the end of walk we > need to leave the lazy mode as well (complete_walk()). Either can fail, > and such a failure means restart from scratch in non-lazy mode. We need > to grab references on vfsmount and dentry (or two dentries, when we have > parent and child to deal with). The interesting part is vfsmount - we need > to make sure we won't interfere with umount(2), having walked in that sucker > *after* umount(2) has checked that it's not busy. See legitimize_mnt() for > details; basically, we have umount(2) mark the "I've verified it's not busy > and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely > on RCU delays on that path - if we find one of those, we undo the > increment of its refcount we'd just done, without having dropped > rcu_read_lock(). Full-blown mntput() is done only on mismatches that > had _not_ been marked that way. > > BTW, something similar to the above probably needs to be turned into coherent > text, either in parallel to Documentation/filesystems/path-lookup.txt, or > as update to it. That might be fun.... if I find time. > > The reason why walk_component() in your call chain won't suffice is simple - > it will fail this > /* > * This sequence count validates that the parent had no > * changes while we did the lookup of the dentry above. > * > * The memory barrier in read_seqcount_begin of child is > * enough, we can use __read_seqcount_retry here. > */ > if (__read_seqcount_retry(&parent->d_seq, nd->seq)) > return -ECHILD; > in lookup_fast(), just before updating nd->seq to new value. Basically, > it has no way to tell if parent has been buggered to hell and back. > > It's not _that_ hard to prevent - we just need to stop discarding the parent's > seq number in "need to follow a symlink" case of walk_component(). Will take > some massage, but not too much... So..... Where I have: + if (nd->flags & LOOKUP_RCU) { + if (!nd->root.mnt) + set_root_rcu(nd); + nd->path = nd->root; in the case where the symlink starts '/', I need to set nd->seq nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); but in the case where symlink doesn't start '/', I don't change nd->path, so nd->seq should still be correct? Also, just after that (still in follow_link()), we have nd->inode = nd->path.dentry->d_inode; *after* the test for (*s == '/'). Wouldn't is be OK to have that *inside* the if? In the relative-symlink case it should be unchanged, and in the jump-symlink case nd_jump_link() has already done that (and probably NULL was returns for 's' so this code doesn't execute). So following on top of my previous patches? Thanks heaps, NeilBrown diff --git a/fs/namei.c b/fs/namei.c index d13b4315447f..ce6387d5317c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p) if (!nd->root.mnt) set_root_rcu(nd); nd->path = nd->root; + nd->seq = read_seqcount_begin( + &nd->path->dentry->d_seq); } else { if (!nd->root.mnt) set_root(nd); @@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, void **p) nd->path = nd->root; path_get(&nd->root); } + nd->inode = nd->path.dentry->d_inode; nd->flags |= LOOKUP_JUMPED; } - nd->inode = nd->path.dentry->d_inode; error = link_path_walk(s, nd); if (unlikely(error)) put_link(nd, link, inode, *p);