Re: lockref scalability on x86-64 vs cpu_relax

From: Will Deacon <will@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-arch <linux-arch@vger.kernel.org>,
	tony.luck@intel.com, Catalin Marinas <catalin.marinas@arm.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org,
	Jan Glauber <jan.glauber@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: lockref scalability on x86-64 vs cpu_relax
Date: Fri, 13 Jan 2023 09:46:25 +0000	[thread overview]
Message-ID: <20230113094625.GA12235@willie-the-truck> (raw)
In-Reply-To: <CAGudoHE0tzL8OAqvwpDR4Nn_g70a8qBdE_+-fmhXF-DEx_K6kg@mail.gmail.com>

On Fri, Jan 13, 2023 at 02:12:50AM +0100, Mateusz Guzik wrote:
> On 1/13/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Side note on your access() changes - if it turns out that you can
> > remove all the cred games, we should possibly then revert my old
> > commit d7852fbd0f04 ("access: avoid the RCU grace period for the
> > temporary subjective credentials") which avoided the biggest issue
> > with the unnecessary cred switching.
> >
> > I *think* access() is the only user of that special 'non_rcu' thing,
> > but it is possible that the whole 'non_rcu' thing ends up mattering
> > for cases where the cred actually does change because euid != uid (ie
> > suid programs), so this would need a bit more effort to do performance
> > testing on.
> >
> 
> I don't think the games are avoidable. For one I found non-root
> processes with non-empty cap_effective even on my laptop, albeit I did
> not check how often something like this is doing access().
> 
> Discussion for another time.
> 
> > On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> All that said, I think the thing to do here is to replace cpu_relax
> >> with a dedicated arch-dependent macro, akin to the following:
> >
> > I would actually prefer just removing it entirely and see if somebody
> > else hollers. You have the numbers to prove it hurts on real hardware,
> > and I don't think we have any numbers to the contrary.
> >
> > So I think it's better to trust the numbers and remove it as a
> > failure, than say "let's just remove it on x86-64 and leave everybody
> > else with the potentially broken code"
> >
> [snip]
> > Then other architectures can try to run their numbers, and only *if*
> > it then turns out that they have a reason to do something else should
> > we make this conditional and different on different architectures.
> >
> > Let's try to keep the code as common as possibly until we have hard
> > evidence for special cases, in other words.
> >
> 
> I did not want to make such a change without redoing the ThunderX2
> benchmark, or at least something else arm64-y. I may be able to bench it
> tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming
> no arm64 people show up with their results.
> 
> Even then IMHO the safest route is to patch it out on x86-64 and give
> other people time to bench their archs as they get around to it, and
> ultimately whack the thing if it turns out nobody benefits from it.
> I would say beats backpedaling on the removal, but I'm not going to
> fight for it.
> 
> That said, does waiting for arm64 numbers and/or producing them for the
> removal commit message sound like a plan? If so, I'll post soon(tm).

Honestly, I wouldn't worry about us (arm64) here. I don't think any real
hardware implements the YIELD instruction (i.e. it behaves as a NOP in
practice). The only place I'm aware of where it _does_ something is in
QEMU, which was actually the motivation behind having it in cpu_relax() to
start with (see 1baa82f48030 ("arm64: Implement cpu_relax as yield")).

So, from the arm64 side of the fence, I'm perfectly happy just removing
the cpu_relax() calls from lockref.

Will