Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

From: Ian Kent <raven@themaw.net>
To: Fox Chen <foxhlchen@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, Greg KH <gregkh@linuxfoundation.org>,
	akpm@linux-foundation.org, dhowells@redhat.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	miklos@szeredi.hu, ricklind@linux.vnet.ibm.com,
	sfr@canb.auug.org.au, viro@zeniv.linux.org.uk
Subject: Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
Date: Tue, 22 Dec 2020 10:17:23 +0800	[thread overview]
Message-ID: <11f9633e2157450021de4886ed37d0b0cd3436c2.camel@themaw.net> (raw)
In-Reply-To: <CAC2o3DLYDRkDwMw1j9kwDfQRXFCdBGgAOR4eEsfjgm_LJJ9DNA@mail.gmail.com>

On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > > 
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > > 
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > > 
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > > 
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > > 
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > > 
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > > 
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > > 
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > > 
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > > 
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > >   assign if existing attrs
> > > > > >   release the allocation lock
> > > > > >   return existing if found
> > > > > > othewise
> > > > > >   release the allocation lock
> > > > > > 
> > > > > > allocate and initialize attrs
> > > > > > 
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > >   free and grab exiting attrs
> > > > > > otherwise
> > > > > >   assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > > 
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > > 
> > > > > > Am I on the right track or can you see problems with this?
> > > > > > 
> > > > > > Ian
> > > > > > 
> > > > > 
> > > > > umm, we update the inode in kernfs_refresh_inode, right??  So
> > > > > I
> > > > > guess
> > > > > the problem is how can we protect the inode when
> > > > > kernfs_refresh_inode
> > > > > is called, not the attrs??
> > > > 
> > > > But the attrs (which is what's copied from) were protected by
> > > > the
> > > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > > dealing with the kernfs node attrs too.
> > > > 
> > > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > > copies
> > > > the node attrs to the inode under the same mutex lock. So, if a
> > > > read
> > > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > > protected,
> > > > it needs to be protected in a different way.
> > > > 
> > > 
> > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem
> > > for
> > > .setattr but
> > >  no lock for .getattr (misdocumented?? sometimes they have as
> > > you've
> > > found out)?
> > > What does it protect against?? Because .permission does a similar
> > > thing
> > > here -- updating inode attributes, the goal is to provide the
> > > same
> > > protection level
> > > for .permission as for .setattr, am I right???
> > 
> > As far as the documentation goes that's probably my
> > misunderstanding
> > of it.
> > 
> > It does happen that the VFS makes assumptions about how call backs
> > are meant to be used.
> > 
> > Read like call backs, like .getattr() and .permission() are meant
> > to
> > be used, well, like read like functions so the VFS should be ok to
> > take locks or not based on the operation context at hand.
> > 
> > So it's not about the locking for these call backs per se, it's
> > about
> > the context in which they are called.
> > 
> > For example, in link_path_walk(), at the beginning of the component
> > lookup loop (essentially for the containing directory at that
> > point),
> > may_lookup() is called which leads to a call to .permission()
> > without
> > any inode lock held at that point.
> > 
> > But file opens (possibly following a path walk to resolve a path)
> > are different.
> > 
> > For example, do_filp_open() calls path_openat() which leads to a
> > call to open_last_lookups(), which leads to a call to .permission()
> > along the way. And in this case there are two contexts, an open()
> > create or one without create, the former needing the exclusive
> > inode
> > lock and the later able to use the shared lock.
> > 
> > So it's about the locking needed for the encompassing operation
> > that
> > is being done not about those functions specifically.
> > 
> > TBH the VFS is very complex and Al has a much, much better
> > understanding of it than I do so he would need to be the one to
> > answer
> > whether it's the file systems responsibility to use these calls in
> > the
> > way the VFS expects.
> > 
> > My belief is that if a file system needs to use a call back in a
> > way
> > that's in conflict with what the VFS expects it's the file systems'
> > responsibility to deal with the side effects.
> > 
> 
> Thanks for clarifying. Ian.
> 
> Yeah, it's complex and confusing and it's very hard to spot lock
> context by reading VFS code.
> 
> I put code like this:
>         if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
>                 if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
>                         pr_warn("kernfs iop_permission inode WRITE
> lock is held");
>                 } else if (lockdep_is_held_type(&inode->i_rwsem, 1))
> {
>                         pr_warn("kernfs iop_permission inode READ
> lock
> is held");
>                 }
>         } else {
>                 pr_warn("kernfs iop_permission inode lock is NOT
> held");
>         }
> 
> in both .permission & .getattr. Then I do some open/read/write/close
> to /sys, interestingly, all log outputs suggest they are in WRITE
> lock
> context.

The thing is in open_last_lookups() called from path_openat() we
have:
	if (open_flag & O_CREAT)
        	        inode_lock(dir->d_inode);
	        else
        	        inode_lock_shared(dir->d_inode);

and from there it's - lookup_open() and possibly may_o_create() which
calls inode_permission() and onto .permission().

So, strictly speaking, it should be taken exclusive because you would
expect may_o_create() to be called only on O_CREATE.

But all the possible cases would need to be checked and if it is taken
shared anywhere we are in trouble.

Another example is in link_path_walk() we have a call to may_lookup()
which leads to a call to .permission() without the lock being held.

So there are a bunch of cases to check and knowing you are the owner
of the lock if it is held is at least risky when concurrency is high
and there's a possibility the lock can be taken shared.

Ian