linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>,
	"Chen, Tim C" <tim.c.chen@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso <dbueso@suse.de>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Jason Low <jason.low2@hp.com>,
	Michel Lespinasse <walken@google.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Waiman Long <waiman.long@hp.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion
Date: Mon, 6 Jun 2016 14:46:44 -0700	[thread overview]
Message-ID: <CA+55aFw6oTECJ-GvYwkz4+RBgBoKiwPXfygOCZzX1V=KUEMG-g@mail.gmail.com> (raw)
In-Reply-To: <20160606211522.GF14480@ZenIV.linux.org.uk>

On Mon, Jun 6, 2016 at 2:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
> obscene amount of grabbing/releasing ->d_lock

Yeah, I saw that report, and I'm not entirely surprised. There was no
real reason to worry about it before, since readdir() couldn't really
cause contention with the previous mutual exclusion at the upper
level.

> I think I have a kinda-sorta solution, but it has a problem.  What I want
> to do is
>         * list_move() only once per dcache_readdir()
>         * ->d_lock taken for that and only for that.
>         * list_move() itself surrounded with write_seqcount_{begin,end} on
> some seqcount
>         * traversal to the next real entry done under rcu_read_lock in a
> seqretry loop.
>
> The only problem is where to put that seqcount (unsigned int, really).
> ->i_dir_seq is an obvious candidate, but that'll need careful profiling
> on getdents/lookup mixes...

Let's look at smaller changes first.

In particular, why the f*ck do we take "next->d_lock" at all, much less twice?

Do we really need that? Both of them seem bogus:

 - the first one only protects the "simple_positive()" test.

    That seems stupid. We hold the parent d_lock _and_ we hold the
parent inode lock for reading, how the hell is simple_positive() going
to change? Yeah, yeah, *maybe* we could catch a lookup that just
happens to add the inode field in process, but it's not a race we even
care about. If we see the inode being non-NULL, it will now *stay*
non-NULL, and we already depend on that (that "d_inode(next)" is then
done without the lock held.

 - the second one only protects "list_move()" of the cursor. But since
it's the child list, the "next->d_lock" thing ends up being
irrelevant. It's the parent dentry lock we need to hold, nothing else.
Not the "next" one.

so I don't see the point of half the d_lock games we play.

And the thing about spinlock contention: having *nested* spinlocks be
contented turns contention into an exponential thing. I really suspect
that if we can just remove the nested spinlock, the dentry->d_lock
contention will go down by a huge amount, because then you completely
remove the "wait on lock while holding another lock" thing, which is
what tends to really suck.

So before you do something fancy, look at the simple parts. I think
getting rid of the nested d_lock might be really simple.

Maybe we might want to do some really careful

     inode = READ_ONCE(dentry->d_inode);

because we're reading the thing without locking, and maybe there's
something else subtle going on. But that really looks fairly
low-hanging, and it might make the problem go away in practice.

              Linus

  reply	other threads:[~2016-06-06 21:46 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-06 20:00 performance delta after VFS i_mutex=>i_rwsem conversion Dave Hansen
2016-06-06 20:46 ` Linus Torvalds
2016-06-06 21:13   ` Waiman Long
2016-06-06 21:20     ` Linus Torvalds
2016-06-07  3:22       ` Valdis.Kletnieks
2016-06-07 15:22         ` Waiman Long
2016-06-08  8:58     ` Ingo Molnar
2016-06-09 10:25       ` Ingo Molnar
2016-06-09 18:14         ` Dave Hansen
2016-06-09 20:10           ` Chen, Tim C
2016-06-06 21:15   ` Al Viro
2016-06-06 21:46     ` Linus Torvalds [this message]
2016-06-06 22:07       ` Al Viro
2016-06-06 23:50         ` Linus Torvalds
2016-06-06 23:59           ` Linus Torvalds
2016-06-07  0:29             ` Linus Torvalds
2016-06-07  0:40           ` Al Viro
2016-06-07  0:44             ` Al Viro
2016-06-07  0:58             ` Al Viro
2016-06-07  0:58             ` Linus Torvalds
2016-06-07  1:19               ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFw6oTECJ-GvYwkz4+RBgBoKiwPXfygOCZzX1V=KUEMG-g@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=dbueso@suse.de \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=waiman.long@hp.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).