All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint
Date: Fri, 24 Apr 2015 14:42:03 +0100	[thread overview]
Message-ID: <20150424134203.GC889@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20150424163534.6eb109eb@notabene.brown>

On Fri, Apr 24, 2015 at 04:35:34PM +1000, NeilBrown wrote:
> > * path_init() arranges for that to be true in the beginning of the walk.
> > 
> > * symlinks aside, walk_component() preserves that.
> > 	+ normal name components go through lookup_fast(), where we have
> > 	  __d_lookup_rcu() find a child of current nd->path with matching
> > 	  name and (atomically) picks ->d_seq of that child, which had the
> > 	  name matching our component.  Atomicity is provided by ->d_lock
> > 	  on child.  Then we proceed to pick ->d_inode (and verify that
> 
> I don't see d_lock being taken  in __d_lookup_rcu.

Do'h...  No, it isn't - sorry about the braino.

> I think this atomicity is provided by ->d_seq on the child.

Yes.  It's fetch ->d_seq, compare names, then in caller fetch ->d_inode and
compare ->d_seq.

> So.....
> Where I have:
> 
> +                       if (nd->flags & LOOKUP_RCU) {
> +                               if (!nd->root.mnt)
> +                                       set_root_rcu(nd);
> +                               nd->path = nd->root;
> 
> 
> in the case where the symlink starts '/', I need to set nd->seq
> 
> 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> 
> 
> but in the case where symlink doesn't start '/', I don't change nd->path,
> so nd->seq should still be correct?

It would be correct, if walk_component() hadn't _already_ flipped it to
match the symlink.  By that point it's too late - we'd already lost the
old value.  This is the area where it hits the fan:
                if (__read_seqcount_retry(&parent->d_seq, nd->seq))
                        return -ECHILD;
OK, parent is still valid
                nd->seq = seq;
... and we lose its seq number, replacing it with that of a child.

                if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) {
                        status = d_revalidate(dentry, nd->flags);
                        if (unlikely(status <= 0)) {
                                if (status != -ECHILD)
                                        need_reval = 0;
                                goto unlazy;
                        }
                }
                path->mnt = mnt;
                path->dentry = dentry;
set path to (nd->path.mnt, dentry of child)
                if (likely(__follow_mount_rcu(nd, path, inode)))
see if we need to (and can) cross the mountpoint here.  That can change
(vfsmount,dentry) pair *and* nd->seq - making it match whatever path->dentry
ends us being.
                        return 0;
if everything's fine, we return to caller, still in RCU mode.
unlazy:
                if (unlazy_walk(nd, dentry))
... and here we attempt to fall back to non-lazy mode, expecting nd->seq to
match dentry.  Win or lose, that's the last point where nd->seq will be looked
at.
                        return -ECHILD;

See the problem?  We have discarded the nd->seq value for parent and we can't
re-pick it without opening a race.

AFAICS, the simplest solution is to have nd->next_seq and set _that_ in
lookup_fast() and __follow_mount_rcu(), using it in unlazy_walk().  And
in callers have it copied into nd->seq after we'd established that it's not
a symlink to be followed.

Handling of absolute symlinks also needs some care - we have no idea whether
nd->root still makes any sense.  It might have become negative by that point.
unlazy_walk() done first would've checked it's still our process' root, but
if we stay in lazy mode, we obviously can't rely on that check.

One possible solution is to do the same kind of check we do in unlazy_walk() - 
        if (!(nd->flags & LOOKUP_ROOT)) {
                spin_lock(&fs->lock);
                if (nd->root.mnt != fs->root.mnt || nd->root.dentry != fs->root.dentry)
			bugger off
		fetch ->d_inode and ->d_seq
                spin_unlock(&fs->lock);
	} else {
		fetch ->d_inode and ->d_seq
	}
Another - to store the ->d_seq of root in nd->seq_root, have set_root_rcu()
set that instead of (or, better, in addition to) returning it to caller,
have path_init() initialize it explicitly in LOOKUP_ROOT case and have this
"reset to root" in following an absolute symlink just do
	set_root_rcu(nd);
	nd->path = nd->root;
	nd->seq = nd->seq_root;
	nd->inode = nd->path.dentry->d_inode;
	check if nd->path.dentry is stale, bugger off if it is

That avoids this spin_lock() on each absolute symlink at the price of extra
32 bits in struct nameidata.  It looks like doing on-demand reallocation
of nd->stack is the right way to go anyway, so the pressure on nameidata size
is going to be weaker and that might be the right way to go...

  reply	other threads:[~2015-04-24 13:42 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-20 18:12 [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint Al Viro
2015-04-20 18:12 ` [PATCH 01/24] lustre: rip the private symlink nesting limit out Al Viro
2015-04-20 19:08   ` Andreas Dilger
2015-04-20 19:22     ` Al Viro
2015-04-20 20:35       ` Al Viro
2015-04-20 18:12 ` [PATCH 02/24] VFS: replace {, total_}link_count in task_struct with pointer to nameidata Al Viro
2015-04-20 18:12 ` [PATCH 03/24] ovl: rearrange ovl_follow_link to it doesn't need to call ->put_link Al Viro
2015-04-20 18:12 ` [PATCH 04/24] VFS: replace nameidata arg to ->put_link with a char* Al Viro
2015-04-20 18:12 ` [PATCH 05/24] SECURITY: remove nameidata arg from inode_follow_link Al Viro
2015-04-20 18:12 ` [PATCH 06/24] VFS: remove nameidata args from ->follow_link Al Viro
2015-04-20 18:12 ` [PATCH 07/24] namei: expand nested_symlink() in its only caller Al Viro
2015-04-20 18:12 ` [PATCH 08/24] namei.c: separate the parts of follow_link() that find the link body Al Viro
2015-04-20 18:12 ` [PATCH 09/24] namei: fold follow_link() into link_path_walk() Al Viro
2015-04-20 18:12 ` [PATCH 10/24] link_path_walk: handle get_link() returning ERR_PTR() immediately Al Viro
2015-04-20 18:12 ` [PATCH 11/24] link_path_walk: don't bother with walk_component() after jumping link Al Viro
2015-04-20 18:12 ` [PATCH 12/24] link_path_walk: turn inner loop into explicit goto Al Viro
2015-04-20 18:12 ` [PATCH 13/24] link_path_walk: massage a bit more Al Viro
2015-04-20 18:12 ` [PATCH 14/24] link_path_walk: get rid of duplication Al Viro
2015-04-20 18:12 ` [PATCH 15/24] link_path_walk: final preparations to killing recursion Al Viro
2015-04-20 18:13 ` [PATCH 16/24] link_path_walk: kill the recursion Al Viro
2015-04-20 21:04   ` Linus Torvalds
2015-04-20 21:32     ` Al Viro
2015-04-20 21:39       ` Linus Torvalds
2015-04-20 21:51         ` Al Viro
2015-04-20 21:41       ` Linus Torvalds
2015-04-20 21:42         ` Linus Torvalds
2015-04-20 21:59           ` Al Viro
2015-04-20 21:52         ` Al Viro
2015-04-20 18:13 ` [PATCH 17/24] link_path_walk: split "return from recursive call" path Al Viro
2015-04-20 18:13 ` [PATCH 18/24] link_path_walk: cleanup - turn goto start; into continue; Al Viro
2015-04-20 18:13 ` [PATCH 19/24] namei: fold may_follow_link() into follow_link() Al Viro
2015-04-20 18:13 ` [PATCH 20/24] namei: introduce nameidata->stack Al Viro
2015-04-20 18:13 ` [PATCH 21/24] namei: regularize use of put_link() and follow_link(), trim arguments Al Viro
2015-04-20 18:13 ` [PATCH 22/24] namei: trim the arguments of get_link() Al Viro
2015-04-20 18:13 ` [PATCH 23/24] new ->follow_link() and ->put_link() calling conventions Al Viro
2015-04-20 18:13 ` [PATCH 24/24] uninline walk_component() Al Viro
2015-04-21 14:49 ` [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint Al Viro
2015-04-21 15:04   ` Christoph Hellwig
2015-04-21 15:12     ` Richard Weinberger
2015-04-21 15:45       ` Al Viro
2015-04-21 16:46         ` Boaz Harrosh
2015-04-21 21:20         ` Al Viro
2015-04-22 18:07           ` Al Viro
2015-04-22 20:12             ` Al Viro
2015-04-22 21:05               ` Al Viro
2015-04-23  7:45                 ` NeilBrown
2015-04-23 18:07                   ` Al Viro
2015-04-24  6:35                     ` NeilBrown
2015-04-24 13:42                       ` Al Viro [this message]
2015-05-04  5:11                         ` Al Viro
2015-05-04  7:30                           ` NeilBrown
2015-04-23  5:01           ` Al Viro
2015-04-21 14:51 ` [PATCH] logfs: fix a pagecache leak for symlinks Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150424134203.GC889@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.