All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: torvalds@linux-foundation.org
Cc: catalin.marinas@arm.com, jan.glauber@gmail.com,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	mjguzik@gmail.com, mpe@ellerman.id.au, tony.luck@intel.com,
	viro@zeniv.linux.org.uk, will@kernel.org
Subject: [PATCH] lockref: stop doing cpu_relax in the cmpxchg loop
Date: Fri, 13 Jan 2023 19:44:47 +0100	[thread overview]
Message-ID: <20230113184447.1707316-1-mjguzik@gmail.com> (raw)
In-Reply-To: <CAHk-=wjthxgrLEvgZBUwd35e_mk=dCWKMUEURC6YsX5nWom8kQ@mail.gmail.com>

On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.

To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel
and Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.

All results in ops/s. Note there is some variance in re-runs, but
the code is consistently faster when contention is present.

open3 ("Same file open/close"):
proc	stock	no-pause
1	805603	814942	(+%1)
2	1054980	1054781	(-0%)
8	1544802	1822858	(+18%)
24	1191064	2199665	(+84%)
48	851582	1469860	(+72%)
96	609481	1427170	(+134%)

fstat2 ("Same file fstat"):
proc	stock	no-pause
1	3013872	3047636	(+1%)
2	4284687	4400421	(+2%)
8	3257721	5530156	(+69%)
24	2239819	5466127	(+144%)
48	1701072	5256609	(+209%)
96	1269157	6649326	(+423%)

Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
proc	stock	patched	patched+nopause
24	2378041	2005501	5370335	(-15% / +125%)

That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.

Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.

Code at hand has a rather tortured history. First modification showed up
in d472d9d98b463dd7 ("lockref: Relax in cmpxchg loop"), written with
Itanium in mind. Later it got patched up to use an arch-dependent macro
to stop doing it on s390 where it caused a significant regression. Said
macro had undergone revisions and was ultimately eliminated later, going
back to cpu_relax.

While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.

> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
>else with the potentially broken code"

Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
> So, from the arm64 side of the fence, I'm perfectly happy just removing
> the cpu_relax() calls from lockref.

As such, come back full circle in history and whack it altogether.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 lib/lockref.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 45e93ece8ba0..2afe4c5d8919 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -23,7 +23,6 @@
 		}								\
 		if (!--retry)							\
 			break;							\
-		cpu_relax();							\
 	}									\
 } while (0)
 
-- 
2.39.0


WARNING: multiple messages have this Message-ID (diff)
From: Mateusz Guzik <mjguzik@gmail.com>
To: torvalds@linux-foundation.org
Cc: catalin.marinas@arm.com, jan.glauber@gmail.com,
	linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	mjguzik@gmail.com, mpe@ellerman.id.au, tony.luck@intel.com,
	viro@zeniv.linux.org.uk, will@kernel.org
Subject: [PATCH] lockref: stop doing cpu_relax in the cmpxchg loop
Date: Fri, 13 Jan 2023 19:44:47 +0100	[thread overview]
Message-ID: <20230113184447.1707316-1-mjguzik@gmail.com> (raw)
In-Reply-To: <CAHk-=wjthxgrLEvgZBUwd35e_mk=dCWKMUEURC6YsX5nWom8kQ@mail.gmail.com>

On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.

To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel
and Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.

All results in ops/s. Note there is some variance in re-runs, but
the code is consistently faster when contention is present.

open3 ("Same file open/close"):
proc	stock	no-pause
1	805603	814942	(+%1)
2	1054980	1054781	(-0%)
8	1544802	1822858	(+18%)
24	1191064	2199665	(+84%)
48	851582	1469860	(+72%)
96	609481	1427170	(+134%)

fstat2 ("Same file fstat"):
proc	stock	no-pause
1	3013872	3047636	(+1%)
2	4284687	4400421	(+2%)
8	3257721	5530156	(+69%)
24	2239819	5466127	(+144%)
48	1701072	5256609	(+209%)
96	1269157	6649326	(+423%)

Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
proc	stock	patched	patched+nopause
24	2378041	2005501	5370335	(-15% / +125%)

That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.

Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.

Code at hand has a rather tortured history. First modification showed up
in d472d9d98b463dd7 ("lockref: Relax in cmpxchg loop"), written with
Itanium in mind. Later it got patched up to use an arch-dependent macro
to stop doing it on s390 where it caused a significant regression. Said
macro had undergone revisions and was ultimately eliminated later, going
back to cpu_relax.

While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.

> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
>else with the potentially broken code"

Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
> So, from the arm64 side of the fence, I'm perfectly happy just removing
> the cpu_relax() calls from lockref.

As such, come back full circle in history and whack it altogether.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 lib/lockref.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 45e93ece8ba0..2afe4c5d8919 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -23,7 +23,6 @@
 		}								\
 		if (!--retry)							\
 			break;							\
-		cpu_relax();							\
 	}									\
 } while (0)
 
-- 
2.39.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Mateusz Guzik <mjguzik@gmail.com>
To: torvalds@linux-foundation.org
Cc: linux-arch@vger.kernel.org, mjguzik@gmail.com, will@kernel.org,
	catalin.marinas@arm.com, tony.luck@intel.com,
	viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org,
	jan.glauber@gmail.com, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org
Subject: [PATCH] lockref: stop doing cpu_relax in the cmpxchg loop
Date: Fri, 13 Jan 2023 19:44:47 +0100	[thread overview]
Message-ID: <20230113184447.1707316-1-mjguzik@gmail.com> (raw)
In-Reply-To: <CAHk-=wjthxgrLEvgZBUwd35e_mk=dCWKMUEURC6YsX5nWom8kQ@mail.gmail.com>

On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.

To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel
and Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.

All results in ops/s. Note there is some variance in re-runs, but
the code is consistently faster when contention is present.

open3 ("Same file open/close"):
proc	stock	no-pause
1	805603	814942	(+%1)
2	1054980	1054781	(-0%)
8	1544802	1822858	(+18%)
24	1191064	2199665	(+84%)
48	851582	1469860	(+72%)
96	609481	1427170	(+134%)

fstat2 ("Same file fstat"):
proc	stock	no-pause
1	3013872	3047636	(+1%)
2	4284687	4400421	(+2%)
8	3257721	5530156	(+69%)
24	2239819	5466127	(+144%)
48	1701072	5256609	(+209%)
96	1269157	6649326	(+423%)

Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
proc	stock	patched	patched+nopause
24	2378041	2005501	5370335	(-15% / +125%)

That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.

Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.

Code at hand has a rather tortured history. First modification showed up
in d472d9d98b463dd7 ("lockref: Relax in cmpxchg loop"), written with
Itanium in mind. Later it got patched up to use an arch-dependent macro
to stop doing it on s390 where it caused a significant regression. Said
macro had undergone revisions and was ultimately eliminated later, going
back to cpu_relax.

While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.

> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
>else with the potentially broken code"

Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
> So, from the arm64 side of the fence, I'm perfectly happy just removing
> the cpu_relax() calls from lockref.

As such, come back full circle in history and whack it altogether.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 lib/lockref.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 45e93ece8ba0..2afe4c5d8919 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -23,7 +23,6 @@
 		}								\
 		if (!--retry)							\
 			break;							\
-		cpu_relax();							\
 	}									\
 } while (0)
 
-- 
2.39.0


  parent reply	other threads:[~2023-01-13 18:45 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-12 23:36 lockref scalability on x86-64 vs cpu_relax Mateusz Guzik
2023-01-13  0:13 ` Linus Torvalds
2023-01-13  0:13   ` Linus Torvalds
2023-01-13  0:13   ` Linus Torvalds
2023-01-13  0:30   ` Luck, Tony
2023-01-13  0:30     ` Luck, Tony
2023-01-13  0:30     ` Luck, Tony
2023-01-13  0:45     ` Linus Torvalds
2023-01-13  0:45       ` Linus Torvalds
2023-01-13  0:45       ` Linus Torvalds
2023-01-13  7:55     ` ia64 removal (was: Re: lockref scalability on x86-64 vs cpu_relax) Ard Biesheuvel
2023-01-13  7:55       ` Ard Biesheuvel
2023-01-13  7:55       ` Ard Biesheuvel
2023-01-13 16:17       ` Luck, Tony
2023-01-13 16:17         ` Luck, Tony
2023-01-13 16:17         ` Luck, Tony
2023-01-13 20:49       ` Jessica Clarke
2023-01-13 20:49         ` Jessica Clarke
2023-01-13 20:49         ` Jessica Clarke
2023-01-13 21:03         ` Luck, Tony
2023-01-13 21:03           ` Luck, Tony
2023-01-13 21:03           ` Luck, Tony
2023-01-13 21:04           ` Jessica Clarke
2023-01-13 21:04             ` Jessica Clarke
2023-01-13 21:04             ` Jessica Clarke
2023-01-13 21:05       ` John Paul Adrian Glaubitz
2023-01-13 21:05         ` John Paul Adrian Glaubitz
2023-01-13 21:05         ` John Paul Adrian Glaubitz
2023-01-13 23:25         ` Ard Biesheuvel
2023-01-13 23:25           ` Ard Biesheuvel
2023-01-13 23:25           ` Ard Biesheuvel
2023-01-14 11:24           ` Sedat Dilek
2023-01-14 11:24             ` Sedat Dilek
2023-01-14 11:24             ` Sedat Dilek
2023-01-14 11:28             ` Sedat Dilek
2023-01-14 11:28               ` Sedat Dilek
2023-01-14 11:28               ` Sedat Dilek
2023-01-15  0:27               ` Matthew Wilcox
2023-01-15  0:27                 ` Matthew Wilcox
2023-01-15  0:27                 ` Matthew Wilcox
2023-01-15 12:04                 ` Sedat Dilek
2023-01-15 12:04                   ` Sedat Dilek
2023-01-15 12:04                   ` Sedat Dilek
2023-01-16  9:42                   ` John Paul Adrian Glaubitz
2023-01-16  9:42                     ` John Paul Adrian Glaubitz
2023-01-16  9:42                     ` John Paul Adrian Glaubitz
2023-01-16  9:41                 ` John Paul Adrian Glaubitz
2023-01-16  9:41                   ` John Paul Adrian Glaubitz
2023-01-16  9:41                   ` John Paul Adrian Glaubitz
2023-01-16 13:28                   ` Matthew Wilcox
2023-01-16 13:28                     ` Matthew Wilcox
2023-01-16 13:28                     ` Matthew Wilcox
2023-01-16  9:40               ` John Paul Adrian Glaubitz
2023-01-16  9:40                 ` John Paul Adrian Glaubitz
2023-01-16  9:40                 ` John Paul Adrian Glaubitz
2023-01-16  9:37             ` John Paul Adrian Glaubitz
2023-01-16  9:37               ` John Paul Adrian Glaubitz
2023-01-16  9:37               ` John Paul Adrian Glaubitz
2023-01-16  9:32           ` John Paul Adrian Glaubitz
2023-01-16  9:32             ` John Paul Adrian Glaubitz
2023-01-16  9:32             ` John Paul Adrian Glaubitz
2023-01-16 10:09             ` Ard Biesheuvel
2023-01-16 10:09               ` Ard Biesheuvel
2023-01-16 10:09               ` Ard Biesheuvel
2023-01-13  1:12   ` lockref scalability on x86-64 vs cpu_relax Mateusz Guzik
2023-01-13  1:12     ` Mateusz Guzik
2023-01-13  1:12     ` Mateusz Guzik
2023-01-13  4:08     ` Linus Torvalds
2023-01-13  4:08       ` Linus Torvalds
2023-01-13  4:08       ` Linus Torvalds
2023-01-13  9:46     ` Will Deacon
2023-01-13  9:46       ` Will Deacon
2023-01-13  9:46       ` Will Deacon
2023-01-13  3:20   ` Nicholas Piggin
2023-01-13  3:20     ` Nicholas Piggin
2023-01-13  3:20     ` Nicholas Piggin
2023-01-13  4:15     ` Linus Torvalds
2023-01-13  4:15       ` Linus Torvalds
2023-01-13  4:15       ` Linus Torvalds
2023-01-13  5:36       ` Nicholas Piggin
2023-01-13  5:36         ` Nicholas Piggin
2023-01-13  5:36         ` Nicholas Piggin
2023-01-16 14:08     ` Memory transaction instructions David Howells
2023-01-16 14:08       ` David Howells
2023-01-16 15:09       ` Matthew Wilcox
2023-01-16 15:09         ` Matthew Wilcox
2023-01-16 15:09         ` Matthew Wilcox
2023-01-16 16:59       ` Linus Torvalds
2023-01-16 16:59         ` Linus Torvalds
2023-01-16 16:59         ` Linus Torvalds
2023-01-18  9:05       ` David Howells
2023-01-18  9:05         ` David Howells
2023-01-18  9:05         ` David Howells
2023-01-19  1:41         ` Nicholas Piggin
2023-01-19  1:41           ` Nicholas Piggin
2023-01-19  1:41           ` Nicholas Piggin
2023-01-13 10:23   ` lockref scalability on x86-64 vs cpu_relax Peter Zijlstra
2023-01-13 10:23     ` Peter Zijlstra
2023-01-13 10:23     ` Peter Zijlstra
2023-01-13 18:44   ` Mateusz Guzik [this message]
2023-01-13 18:44     ` [PATCH] lockref: stop doing cpu_relax in the cmpxchg loop Mateusz Guzik
2023-01-13 18:44     ` Mateusz Guzik
2023-01-13 21:47     ` Luck, Tony
2023-01-13 21:47       ` Luck, Tony
2023-01-13 21:47       ` Luck, Tony
2023-01-13 23:31       ` Linus Torvalds
2023-01-13 23:31         ` Linus Torvalds
2023-01-13 23:31         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230113184447.1707316-1-mjguzik@gmail.com \
    --to=mjguzik@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=jan.glauber@gmail.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.