Re: [PATCH] VFS: Cut down inode->i_op->xyz accesses in path walking

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] VFS: Cut down inode->i_op->xyz accesses in path walking
Date: Mon, 25 Jul 2011 20:23:38 -0700	[thread overview]
Message-ID: <CA+55aFwNQLXT=4KPsq6xsv-pjk2a=8+mKAe5Jmd4QSnC7-+a3g@mail.gmail.com> (raw)
In-Reply-To: <20110726030531.GE22133@ZenIV.linux.org.uk>

On Mon, Jul 25, 2011 at 8:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We could actually cheat a bit.  *All* inodes that ever reach inode_permission()
> have at some point been pointed to by ->d_inode of some dentry.  It's
> inconvenient as hell (and inviting abuse) to pass such dentry instead
> of inode; moreover, there would be nasty questions of ->d_inode stability
> on RCU path.
>
> However, we can greatly reduce that amount of changes if we do the
> following:
>        * one bit in your new field set when inode is put in ->d_inode.
> Checked at inode_permission() time, BUG_ON() if not set.
>        * other bit(s) set by checking ->i_op contents at the same time,
> if bit hadn't been already set (->i_op can't change afterwards); checked
> by inode_permission() and whatever else might want to (e.g. checks for
> non-NULL ->lookup(); for those we also know that inode went through
> some ->d_inode at some point).
>        * we need to play with setting these flags only in two places -
> __d_instantiate() and d_obtain_alias().

Ugh. That sounds more complex than I'd really like. Smaller change,
yes. But the end result is just more complicated.

What I *did* consider was to have the bits just be "slow case" bits.
And just set them all (blindly) at inode allocation time. Then, the
users would simply end up doing

  static inline int do_inode_permission(struct inode *inode, int mask)
  {
    if (inode->i_opflag & SLOW_PERMISSION) {
      if (inode->i_op->permission)
        return inode->i_op->permission(inode, mask);
      bit_clear(SLOW_PERMISSION, &inode->i_opflag);
    }
    return generic_permission(inode, mask);
  }

which basically would trigger the *really* slow case once (that atomic
bit clear that happens just once - or maybe a couple of concurrent
times for the harmless race), but would have very trivial complexity,
and you could tailor the bits arbitrarily for what you want to test as
the fast case.

The reason I didn't go any further along that way is that we don't
have any nice atomic operations for small fields like that. The bitops
want a "long" entity, and it becomes just a nightmare to work out bit
positions with big-endian etc. And I definitely didn't want to grow
the inode.

But I think it would be a reasonable approach with less complexity,
and we could even use a spinlock (ie i_lock) for the slow case, since
it really should only ever happen once per inode load.

It's absolutely ok to test the i_op fields when we actually use them,
because by then we'll obviously have the whole "populate the cache and
follow the pointer chain" issue anyway. So we have a clear "fast case"
when we simply don't want to even look at that pointer at all, and the
above kind of code would work fine, I think.

That said, the big patch is clearly the simplest approach of them all.
It does have the "did you set i_op with the proper function" of
course, but that' s at least not a *complex* bug, and if you get the
i_opflag value wrong, it will be mostly be quite obvious (eg a NULL
pointer dereference or simply not calling the filesystem permission
check at all).

But I'm not married to any particular solution. It's just annoying how
visible that i_op access is in profiles, and how close we are to not
needing it ;)

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html