lockref scalability on x86-64 vs cpu_relax

From: Mateusz Guzik <mjguzik@gmail.com>
To: linux-fsdevel@vger.kernel.org
Cc: viro@zeniv.linux.org.uk, torvalds@linux-foundation.org,
	Jan Glauber <jan.glauber@gmail.com>,
	tony.luck@intel.com
Subject: lockref scalability on x86-64 vs cpu_relax
Date: Fri, 13 Jan 2023 00:36:55 +0100	[thread overview]
Message-ID: <CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com> (raw)

Hello,

[cc is probably woefully incomplete, just grabbed people from lockref
history; should this land on a x86 list instead of vfs?]

I intended to send a patch which fixes cred-related bottleneck in
access(), and while getting the expected win for calls with different
files, I got a *slowdown* when benchmarking against the same file and
according to perf top it was all lockref. I'm going to post it after
this issue is resolved, interested parties can take a peek here:
https://dpaste.com/8SVDF8HJH .

The problem is visible with open3 test ("Same file open/close") from
will-it-scale. I ran the _processes variant against stock + no-pause
kernel on Cascade Lake (2 sockets * 24 cores * 2 threads) running
6.2-rc3.

Results are:
proc    stock   no-pause
1       805603  814942
2       1054980 1054781
8       1544802 1822858
24      1191064 2199665
48      851582  1469860
96      609481  1427170

As you can see degradation already shows up at ~8 workers.

While trying to do my homework regarding history in the area I found
this thread:
https://lkml.iu.edu/hypermail/linux/kernel/1309.0/02330.html

It mentions a stat-based test, which I presume was multithreaded
stat on the same dentry. will-it-scale somehow does not have stat
benches, so I posted a PR to add some:
https://github.com/antonblanchard/will-it-scale/pull/35/files

With fstat2_processes (Same file fstat) I get:
proc    stock   no-pause
1       3013872 3047636
2       4284687 4400421
8       3257721 5530156
24      2239819 5466127
48      1701072 5256609
96      1269157 6649326

To my understanding on said architecture failed cmpxchg still grants you
exclusive access to the cacheline, making immediate retry preferable
when trying to inc/dec unless a certain value is found. By doing pause
instead one not only induces a delay, but also increases likelihood that
the line will have to be grabbed E again. Something to that extent was
even stated in thread and it definitely lines up with results above.

I see pause first shoed up first here:
commit d472d9d98b463dd7a04f2bcdeafe4261686ce6ab
Author: Tony Luck <tony.luck@intel.com>
Date:   Tue Sep 3 14:49:49 2013 -0700

    lockref: Relax in cmpxchg loop

... without numbers attached to it. Given the above linked thread it
looks like the arch this was targeting was itanium, not x86-64, but
the change landed for everyone.

Later it was further augmented with:
commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d
Author: Jan Glauber <jan.glauber@gmail.com>
Date:   Wed Jun 5 15:48:49 2019 +0200

    lockref: Limit number of cmpxchg loop retries
[snip]
    With the retry limit the performance of an open-close testcase
    improved between 60-70% on ThunderX2.

While the benchmark was specifically on ThunderX2, the change once more
was made for all archs.

I should note in my tests the retry limit was never reached fwiw.

That aside, the open-close testcase mentioned should match open3.

All that said, I think the thing to do here is to replace cpu_relax
with a dedicated arch-dependent macro, akin to the following:

diff --git a/lib/lockref.c b/lib/lockref.c
index 45e93ece8ba0..e057e1630e7c 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -2,6 +2,10 @@
 #include <linux/export.h>
 #include <linux/lockref.h>

+#ifndef arch_cpu_relax_cmpxchg_loop
+#define arch_cpu_relax_cmpxchg_loop() cpu_relax()
+#endif
+
 #if USE_CMPXCHG_LOCKREF

 /*
@@ -23,7 +27,7 @@
                }
         \
                if (!--retry)
         \
                        break;
         \
-               cpu_relax();
         \
+               arch_cpu_relax_cmpxchg_loop();
         \
        }
         \
 } while (0)

Then x86-64 would simply:
+#define        arch_cpu_relax_cmpxchg_loop do { } while (0)

Not an actual patch and I don't care about the name, just illustrating
what I mean.

I have to note there are probably numerous other cmpxchg loops without
the pause/fallback treatment, quick grep reveals one in
refcount_dec_not_one, if the fallback and/or cpu_relax thing is indeed
desirable the other loops should probably get augmented to have it.

Any comments?

If you agree with the idea I'll submit a  proper patch.

-- 
Mateusz Guzik <mjguzik gmail.com>