All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fox Chen <foxhlchen@gmail.com>
To: Ian Kent <raven@themaw.net>
Cc: Tejun Heo <tj@kernel.org>, Greg KH <gregkh@linuxfoundation.org>,
	akpm@linux-foundation.org, dhowells@redhat.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	miklos@szeredi.hu, ricklind@linux.vnet.ibm.com,
	sfr@canb.auug.org.au, viro@zeniv.linux.org.uk
Subject: Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
Date: Sat, 19 Dec 2020 15:47:19 +0800	[thread overview]
Message-ID: <CAC2o3DLYDRkDwMw1j9kwDfQRXFCdBGgAOR4eEsfjgm_LJJ9DNA@mail.gmail.com> (raw)
In-Reply-To: <ecf41abd583d5d2c775d9d385ea2a0af7b275037.camel@themaw.net>

On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote:
>
> On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote:
> > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net>
> > > > wrote:
> > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > > too
> > > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > > Tejun has said.
> > > > > > >
> > > > > > > I guess the question to ask is, is there really a need to
> > > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > > reading/checking functions.
> > > > > > >
> > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > write/set
> > > > > > > operations in (if there's any) places where things like
> > > > > > > setattr_copy() is not already called?
> > > > > > >
> > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > >
> > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > sysfs
> > > > > > namespace is
> > > > > > implemented, so I don't think there's an easy around that.
> > > > > > The
> > > > > > only
> > > > > > thing I
> > > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > > dance
> > > > > > when
> > > > > > attaching it.
> > > > >
> > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > attrs structure, am I correct?
> > > > >
> > > > > Assuming it is then, to keep things simple, use two locks.
> > > > >
> > > > > One global lock for the allocation and an attrs lock for all
> > > > > the
> > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > update.
> > > > >
> > > > > The critical section for the global lock could be reduced and
> > > > > it
> > > > > changed to a spin lock.
> > > > >
> > > > > In __kernfs_iattrs() we would have something like:
> > > > >
> > > > > take the allocation lock
> > > > > do the allocated checks
> > > > >   assign if existing attrs
> > > > >   release the allocation lock
> > > > >   return existing if found
> > > > > othewise
> > > > >   release the allocation lock
> > > > >
> > > > > allocate and initialize attrs
> > > > >
> > > > > take the allocation lock
> > > > > check if someone beat us to it
> > > > >   free and grab exiting attrs
> > > > > otherwise
> > > > >   assign the new attrs
> > > > > release the allocation lock
> > > > > return attrs
> > > > >
> > > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > > field updates.
> > > > >
> > > > > Am I on the right track or can you see problems with this?
> > > > >
> > > > > Ian
> > > > >
> > > >
> > > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > > guess
> > > > the problem is how can we protect the inode when
> > > > kernfs_refresh_inode
> > > > is called, not the attrs??
> > >
> > > But the attrs (which is what's copied from) were protected by the
> > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > dealing with the kernfs node attrs too.
> > >
> > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > copies
> > > the node attrs to the inode under the same mutex lock. So, if a
> > > read
> > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > protected,
> > > it needs to be protected in a different way.
> > >
> >
> > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> > .setattr but
> >  no lock for .getattr (misdocumented?? sometimes they have as you've
> > found out)?
> > What does it protect against?? Because .permission does a similar
> > thing
> > here -- updating inode attributes, the goal is to provide the same
> > protection level
> > for .permission as for .setattr, am I right???
>
> As far as the documentation goes that's probably my misunderstanding
> of it.
>
> It does happen that the VFS makes assumptions about how call backs
> are meant to be used.
>
> Read like call backs, like .getattr() and .permission() are meant to
> be used, well, like read like functions so the VFS should be ok to
> take locks or not based on the operation context at hand.
>
> So it's not about the locking for these call backs per se, it's about
> the context in which they are called.
>
> For example, in link_path_walk(), at the beginning of the component
> lookup loop (essentially for the containing directory at that point),
> may_lookup() is called which leads to a call to .permission() without
> any inode lock held at that point.
>
> But file opens (possibly following a path walk to resolve a path)
> are different.
>
> For example, do_filp_open() calls path_openat() which leads to a
> call to open_last_lookups(), which leads to a call to .permission()
> along the way. And in this case there are two contexts, an open()
> create or one without create, the former needing the exclusive inode
> lock and the later able to use the shared lock.
>
> So it's about the locking needed for the encompassing operation that
> is being done not about those functions specifically.
>
> TBH the VFS is very complex and Al has a much, much better
> understanding of it than I do so he would need to be the one to answer
> whether it's the file systems responsibility to use these calls in the
> way the VFS expects.
>
> My belief is that if a file system needs to use a call back in a way
> that's in conflict with what the VFS expects it's the file systems'
> responsibility to deal with the side effects.
>

Thanks for clarifying. Ian.

Yeah, it's complex and confusing and it's very hard to spot lock
context by reading VFS code.

I put code like this:
        if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
                if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
                        pr_warn("kernfs iop_permission inode WRITE
lock is held");
                } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) {
                        pr_warn("kernfs iop_permission inode READ lock
is held");
                }
        } else {
                pr_warn("kernfs iop_permission inode lock is NOT held");
        }

in both .permission & .getattr. Then I do some open/read/write/close
to /sys, interestingly, all log outputs suggest they are in WRITE lock
context.

and I put dump_stack() to the lock-is-held if branch, it prints a lot
of following context:

[  481.795445] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.0-dirty #25
[  481.795446] Hardware name: Parallels Software International Inc.
Parallels Virtual Platform/Parallels Virtual Platform, BIOS 15.1.5
(47309) 09/26/2020
[  481.795446] Call Trace:
[  481.795448] dump_stack (lib/dump_stack.c:120)
[  481.795450] kernfs_iop_permission (fs/kernfs/inode.c:295)
[  481.795452] inode_permission.part.0 (fs/namei.c:398 fs/namei.c:463)
[  481.795454] may_open (fs/namei.c:2875)
[  481.795456] path_openat (fs/namei.c:3250 fs/namei.c:3369)
[  481.795458] ? ___sys_sendmsg (net/socket.c:2411)
[  481.795460] ? preempt_count_add (./include/linux/ftrace.h:821
kernel/sched/core.c:4166 kernel/sched/core.c:4163
kernel/sched/core.c:4191)
[  481.795461] ? sock_poll (net/socket.c:1265)
[  481.795463] do_filp_open (fs/namei.c:3396)
[  481.795466] ? __check_object_size (mm/usercopy.c:240
mm/usercopy.c:286 mm/usercopy.c:256)
[  481.795469] ? _raw_spin_unlock
(./arch/x86/include/asm/preempt.h:102
./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  481.795470] do_sys_openat2 (fs/open.c:1168)
[  481.795472] __x64_sys_openat (fs/open.c:1195)
[  481.795473] do_syscall_64 (arch/x86/entry/common.c:46)
[  481.795475] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127)
[  481.795476] RIP: 0033:0x7f6b31d69c94

Surprisingly, I didn't see the lock holding code along the path.


thanks,
fox

  reply	other threads:[~2020-12-19  7:48 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-17  7:37 [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement Ian Kent
2020-06-17  7:37 ` [PATCH v2 1/6] kernfs: switch kernfs to use an rwsem Ian Kent
2020-06-17  7:37 ` [PATCH v2 2/6] kernfs: move revalidate to be near lookup Ian Kent
2020-06-17  7:37 ` [PATCH v2 3/6] kernfs: improve kernfs path resolution Ian Kent
2020-06-17  7:38 ` [PATCH v2 4/6] kernfs: use revision to identify directory node changes Ian Kent
2020-06-17  7:38 ` [PATCH v2 5/6] kernfs: refactor attr locking Ian Kent
2020-06-17  7:38 ` [PATCH v2 6/6] kernfs: make attr_mutex a local kernfs node lock Ian Kent
2020-06-19 15:38 ` [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement Tejun Heo
2020-06-19 20:41   ` Rick Lindsley
2020-06-19 22:23     ` Tejun Heo
2020-06-20  2:44       ` Rick Lindsley
2020-06-22 17:53         ` Tejun Heo
2020-06-22 21:22           ` Rick Lindsley
2020-06-23 23:13             ` Tejun Heo
2020-06-24  9:04               ` Rick Lindsley
2020-06-24  9:27                 ` Greg Kroah-Hartman
2020-06-24 13:19                 ` Tejun Heo
2020-06-25  8:15               ` Ian Kent
2020-06-25  9:43                 ` Greg Kroah-Hartman
2020-06-26  0:19                   ` Ian Kent
2020-06-21  4:55       ` Ian Kent
2020-06-22 17:48         ` Tejun Heo
2020-06-22 18:03           ` Greg Kroah-Hartman
2020-06-22 21:27             ` Rick Lindsley
2020-06-23  5:21               ` Greg Kroah-Hartman
2020-06-23  5:09             ` Ian Kent
2020-06-23  6:02               ` Greg Kroah-Hartman
2020-06-23  8:01                 ` Ian Kent
2020-06-23  8:29                   ` Ian Kent
2020-06-23 11:49                   ` Greg Kroah-Hartman
2020-06-23  9:33                 ` Rick Lindsley
2020-06-23 11:45                   ` Greg Kroah-Hartman
2020-06-23 22:55                     ` Rick Lindsley
2020-06-23 11:51                   ` Ian Kent
2020-06-21  3:21   ` Ian Kent
2020-12-10 16:44 ` Fox Chen
2020-12-11  2:01   ` [PATCH " Ian Kent
2020-12-11  2:17     ` Ian Kent
2020-12-13  3:46       ` Ian Kent
2020-12-14  6:14         ` Fox Chen
2020-12-14 13:30           ` Ian Kent
2020-12-15  8:33             ` Fox Chen
2020-12-15 12:59               ` Ian Kent
2020-12-17  4:46                 ` Ian Kent
2020-12-17  8:54                   ` Fox Chen
2020-12-17 10:09                     ` Ian Kent
2020-12-17 11:09                       ` Ian Kent
2020-12-17 11:48                         ` Ian Kent
2020-12-17 15:14                           ` Tejun Heo
2020-12-18  7:36                             ` Ian Kent
2020-12-18  8:01                               ` Fox Chen
2020-12-18 11:21                                 ` Ian Kent
2020-12-18 13:20                                   ` Fox Chen
2020-12-19  0:53                                     ` Ian Kent
2020-12-19  7:47                                       ` Fox Chen [this message]
2020-12-22  2:17                                         ` Ian Kent
2020-12-18 14:59                               ` Tejun Heo
2020-12-19  7:08                                 ` Ian Kent
2020-12-19 16:23                                   ` Tejun Heo
2020-12-19 23:52                                     ` Ian Kent
2020-12-20  1:37                                       ` Ian Kent
2020-12-21  9:28                                       ` Fox Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAC2o3DLYDRkDwMw1j9kwDfQRXFCdBGgAOR4eEsfjgm_LJJ9DNA@mail.gmail.com \
    --to=foxhlchen@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=raven@themaw.net \
    --cc=ricklind@linux.vnet.ibm.com \
    --cc=sfr@canb.auug.org.au \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.