From: Ian Kent <raven@themaw.net>
To: Fox Chen <foxhlchen@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, Greg KH <gregkh@linuxfoundation.org>,
akpm@linux-foundation.org, dhowells@redhat.com,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
miklos@szeredi.hu, ricklind@linux.vnet.ibm.com,
sfr@canb.auug.org.au, viro@zeniv.linux.org.uk
Subject: Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
Date: Tue, 22 Dec 2020 10:17:23 +0800 [thread overview]
Message-ID: <11f9633e2157450021de4886ed37d0b0cd3436c2.camel@themaw.net> (raw)
In-Reply-To: <CAC2o3DLYDRkDwMw1j9kwDfQRXFCdBGgAOR4eEsfjgm_LJJ9DNA@mail.gmail.com>
On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net>
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > >
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > >
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > >
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > >
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > >
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > >
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > >
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > >
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > >
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > >
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > > assign if existing attrs
> > > > > > release the allocation lock
> > > > > > return existing if found
> > > > > > othewise
> > > > > > release the allocation lock
> > > > > >
> > > > > > allocate and initialize attrs
> > > > > >
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > > free and grab exiting attrs
> > > > > > otherwise
> > > > > > assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > >
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > >
> > > > > > Am I on the right track or can you see problems with this?
> > > > > >
> > > > > > Ian
> > > > > >
> > > > >
> > > > > umm, we update the inode in kernfs_refresh_inode, right?? So
> > > > > I
> > > > > guess
> > > > > the problem is how can we protect the inode when
> > > > > kernfs_refresh_inode
> > > > > is called, not the attrs??
> > > >
> > > > But the attrs (which is what's copied from) were protected by
> > > > the
> > > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > > dealing with the kernfs node attrs too.
> > > >
> > > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > > copies
> > > > the node attrs to the inode under the same mutex lock. So, if a
> > > > read
> > > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > > protected,
> > > > it needs to be protected in a different way.
> > > >
> > >
> > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem
> > > for
> > > .setattr but
> > > no lock for .getattr (misdocumented?? sometimes they have as
> > > you've
> > > found out)?
> > > What does it protect against?? Because .permission does a similar
> > > thing
> > > here -- updating inode attributes, the goal is to provide the
> > > same
> > > protection level
> > > for .permission as for .setattr, am I right???
> >
> > As far as the documentation goes that's probably my
> > misunderstanding
> > of it.
> >
> > It does happen that the VFS makes assumptions about how call backs
> > are meant to be used.
> >
> > Read like call backs, like .getattr() and .permission() are meant
> > to
> > be used, well, like read like functions so the VFS should be ok to
> > take locks or not based on the operation context at hand.
> >
> > So it's not about the locking for these call backs per se, it's
> > about
> > the context in which they are called.
> >
> > For example, in link_path_walk(), at the beginning of the component
> > lookup loop (essentially for the containing directory at that
> > point),
> > may_lookup() is called which leads to a call to .permission()
> > without
> > any inode lock held at that point.
> >
> > But file opens (possibly following a path walk to resolve a path)
> > are different.
> >
> > For example, do_filp_open() calls path_openat() which leads to a
> > call to open_last_lookups(), which leads to a call to .permission()
> > along the way. And in this case there are two contexts, an open()
> > create or one without create, the former needing the exclusive
> > inode
> > lock and the later able to use the shared lock.
> >
> > So it's about the locking needed for the encompassing operation
> > that
> > is being done not about those functions specifically.
> >
> > TBH the VFS is very complex and Al has a much, much better
> > understanding of it than I do so he would need to be the one to
> > answer
> > whether it's the file systems responsibility to use these calls in
> > the
> > way the VFS expects.
> >
> > My belief is that if a file system needs to use a call back in a
> > way
> > that's in conflict with what the VFS expects it's the file systems'
> > responsibility to deal with the side effects.
> >
>
> Thanks for clarifying. Ian.
>
> Yeah, it's complex and confusing and it's very hard to spot lock
> context by reading VFS code.
>
> I put code like this:
> if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
> if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
> pr_warn("kernfs iop_permission inode WRITE
> lock is held");
> } else if (lockdep_is_held_type(&inode->i_rwsem, 1))
> {
> pr_warn("kernfs iop_permission inode READ
> lock
> is held");
> }
> } else {
> pr_warn("kernfs iop_permission inode lock is NOT
> held");
> }
>
> in both .permission & .getattr. Then I do some open/read/write/close
> to /sys, interestingly, all log outputs suggest they are in WRITE
> lock
> context.
The thing is in open_last_lookups() called from path_openat() we
have:
if (open_flag & O_CREAT)
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
and from there it's - lookup_open() and possibly may_o_create() which
calls inode_permission() and onto .permission().
So, strictly speaking, it should be taken exclusive because you would
expect may_o_create() to be called only on O_CREATE.
But all the possible cases would need to be checked and if it is taken
shared anywhere we are in trouble.
Another example is in link_path_walk() we have a call to may_lookup()
which leads to a call to .permission() without the lock being held.
So there are a bunch of cases to check and knowing you are the owner
of the lock if it is held is at least risky when concurrency is high
and there's a possibility the lock can be taken shared.
Ian
next prev parent reply other threads:[~2020-12-22 2:18 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-17 7:37 [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement Ian Kent
2020-06-17 7:37 ` [PATCH v2 1/6] kernfs: switch kernfs to use an rwsem Ian Kent
2020-06-17 7:37 ` [PATCH v2 2/6] kernfs: move revalidate to be near lookup Ian Kent
2020-06-17 7:37 ` [PATCH v2 3/6] kernfs: improve kernfs path resolution Ian Kent
2020-06-17 7:38 ` [PATCH v2 4/6] kernfs: use revision to identify directory node changes Ian Kent
2020-06-17 7:38 ` [PATCH v2 5/6] kernfs: refactor attr locking Ian Kent
2020-06-17 7:38 ` [PATCH v2 6/6] kernfs: make attr_mutex a local kernfs node lock Ian Kent
2020-06-19 15:38 ` [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement Tejun Heo
2020-06-19 20:41 ` Rick Lindsley
2020-06-19 22:23 ` Tejun Heo
2020-06-20 2:44 ` Rick Lindsley
2020-06-22 17:53 ` Tejun Heo
2020-06-22 21:22 ` Rick Lindsley
2020-06-23 23:13 ` Tejun Heo
2020-06-24 9:04 ` Rick Lindsley
2020-06-24 9:27 ` Greg Kroah-Hartman
2020-06-24 13:19 ` Tejun Heo
2020-06-25 8:15 ` Ian Kent
2020-06-25 9:43 ` Greg Kroah-Hartman
2020-06-26 0:19 ` Ian Kent
2020-06-21 4:55 ` Ian Kent
2020-06-22 17:48 ` Tejun Heo
2020-06-22 18:03 ` Greg Kroah-Hartman
2020-06-22 21:27 ` Rick Lindsley
2020-06-23 5:21 ` Greg Kroah-Hartman
2020-06-23 5:09 ` Ian Kent
2020-06-23 6:02 ` Greg Kroah-Hartman
2020-06-23 8:01 ` Ian Kent
2020-06-23 8:29 ` Ian Kent
2020-06-23 11:49 ` Greg Kroah-Hartman
2020-06-23 9:33 ` Rick Lindsley
2020-06-23 11:45 ` Greg Kroah-Hartman
2020-06-23 22:55 ` Rick Lindsley
2020-06-23 11:51 ` Ian Kent
2020-06-21 3:21 ` Ian Kent
2020-12-10 16:44 ` Fox Chen
2020-12-11 2:01 ` [PATCH " Ian Kent
2020-12-11 2:17 ` Ian Kent
2020-12-13 3:46 ` Ian Kent
2020-12-14 6:14 ` Fox Chen
2020-12-14 13:30 ` Ian Kent
2020-12-15 8:33 ` Fox Chen
2020-12-15 12:59 ` Ian Kent
2020-12-17 4:46 ` Ian Kent
2020-12-17 8:54 ` Fox Chen
2020-12-17 10:09 ` Ian Kent
2020-12-17 11:09 ` Ian Kent
2020-12-17 11:48 ` Ian Kent
2020-12-17 15:14 ` Tejun Heo
2020-12-18 7:36 ` Ian Kent
2020-12-18 8:01 ` Fox Chen
2020-12-18 11:21 ` Ian Kent
2020-12-18 13:20 ` Fox Chen
2020-12-19 0:53 ` Ian Kent
2020-12-19 7:47 ` Fox Chen
2020-12-22 2:17 ` Ian Kent [this message]
2020-12-18 14:59 ` Tejun Heo
2020-12-19 7:08 ` Ian Kent
2020-12-19 16:23 ` Tejun Heo
2020-12-19 23:52 ` Ian Kent
2020-12-20 1:37 ` Ian Kent
2020-12-21 9:28 ` Fox Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=11f9633e2157450021de4886ed37d0b0cd3436c2.camel@themaw.net \
--to=raven@themaw.net \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=foxhlchen@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=ricklind@linux.vnet.ibm.com \
--cc=sfr@canb.auug.org.au \
--cc=tj@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).