All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yury Norov <yury.norov@gmail.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Lucas Stach <l.stach@pengutronix.de>,
	Russell King <linux+etnaviv@armlinux.org.uk>,
	Christian Gmeiner <christian.gmeiner@gmail.com>,
	David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>,
	Jean Delvare <jdelvare@suse.com>,
	Guenter Roeck <linux@roeck-us.net>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	David Woodhouse <dwmw@amazon.co.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Geert Uytterhoeven <geert+renesas@glider.be>,
	Alexey Klimov <aklimov@redhat.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	etnaviv@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-hwmon@vger.kernel.org
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()
Date: Sun, 27 Jun 2021 09:47:25 -0700	[thread overview]
Message-ID: <YNirnaYw1GSxg1jK@yury-ThinkPad> (raw)
In-Reply-To: <YM4pJpNphEwvUF2F@yury-ThinkPad>

On Sat, Jun 19, 2021 at 10:28:07AM -0700, Yury Norov wrote:
> On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote:
> > On Fri, 18 Jun 2021 20:57:34 +0100,
> > Yury Norov <yury.norov@gmail.com> wrote:
> > > 
> > > The macros iterate thru all set/clear bits in a bitmap. They search a
> > > first bit using find_first_bit(), and the rest bits using find_next_bit().
> > > 
> > > Since find_next_bit() is called shortly after find_first_bit(), we can
> > > save few lines of I-cache by not using find_first_bit().
> > 
> > Really?
> > 
> > > 
> > > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > > ---
> > >  include/linux/find.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/find.h b/include/linux/find.h
> > > index 4500e8ab93e2..ae9ed52b52b8 100644
> > > --- a/include/linux/find.h
> > > +++ b/include/linux/find.h
> > > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> > >  #endif
> > >  
> > >  #define for_each_set_bit(bit, addr, size) \
> > > -	for ((bit) = find_first_bit((addr), (size));		\
> > > +	for ((bit) = find_next_bit((addr), (size), 0);		\
> > 
> > On which architecture do you observe a gain? Only 32bit ARM and m68k
> > implement their own version of find_first_bit(), and everyone else
> > uses the canonical implementation:
> 
> And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips
> and s390.
> 
> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > 
> > These architectures explicitly have different implementations for
> > find_first_bit() and find_next_bit() because they can do better
> > (whether that is true or not is another debate). I don't think you
> > should remove this optimisation until it has been measured on these
> > two architectures.
> 
> This patch is based on a series that enables separate implementation
> of find_first_bit() for all architectures; according to my tests,
> find_first* is ~ twice faster than find_next* on arm64 and x86.
> 
> https://lore.kernel.org/lkml/20210612123639.329047-1-yury.norov@gmail.com/T/#t
> 
> After applying the series, I noticed that my small kernel module that
> calls for_each_set_bit() is now using find_first_bit() to just find
> one bit, and find_next_bit() for all others. I think it's better to
> always use find_next_bit() in this case to minimize the chance of
> cache miss. But if it's not that obvious, I'll try to write some test.

This test measures the difference between for_each_set_bit() and
for_each_set_bit_from().

diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
index 5637c5711db9..1f37e99090b0 100644
--- a/lib/find_bit_benchmark.c
+++ b/lib/find_bit_benchmark.c
@@ -111,6 +111,59 @@ static int __init test_find_next_and_bit(const void *bitmap,
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+#define flush_cache_all() wbinvd()
+#endif
+
+static int __init test_for_each_set_bit(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		for_each_set_bit(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit:   %18llu ns, %6ld iterations\n",  time, cnt);
+#endif
+	return 0;
+}
+
+static int __init test_for_each_set_bit_from(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		i = 0;
+		for_each_set_bit_from(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit_from:%16llu ns, %6ld iterations\n", time, cnt);
+#endif
+	return 0;
+}
+
 static int __init find_bit_test(void)
 {
 	unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -147,6 +200,16 @@ static int __init find_bit_test(void)
 	test_find_first_bit(bitmap, BITMAP_LEN);
 	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
+	pr_err("\nStart testing for_each_bit()\n");
+
+	test_for_each_set_bit(0);
+	test_for_each_set_bit_from(0);
+
+	pr_err("\nStart testing for_each_bit() with cash flushing\n");
+
+	test_for_each_set_bit(1);
+	test_for_each_set_bit_from(1);
+
 	/*
 	 * Everything is OK. Return error just to let user run benchmark
 	 * again without annoying rmmod.

Here on each iteration: 
 - for_each_set_bit() calls find_first_bit() once, and find_next_bit() once.
 - for_each_set_bit_from() calls  find_next_bit() twice.

On my AMD Ryzen 7 4700U, the result is like this:

Start testing for_each_bit()
for_each_set_bit:                15296 ns,   1000 iterations
for_each_set_bit_from:           15225 ns,   1000 iterations

Start testing for_each_bit() with cash flushing
for_each_set_bit:               547626 ns,   1000 iterations
for_each_set_bit_from:          497899 ns,   1000 iterations

for_each_set_bit_from() is ~10% faster than for_each_set_bit() in
case of cold caches, and no significant difference was observed if
flush_cache_all() is not called.

So, it looks reasonable to switch for_each_set_bit() to use
find_next_bit() only.

Thanks,
Yury

WARNING: multiple messages have this Message-ID (diff)
From: Yury Norov <yury.norov@gmail.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>,
	Geert Uytterhoeven <geert+renesas@glider.be>,
	David Airlie <airlied@linux.ie>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	dri-devel@lists.freedesktop.org, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
	Russell King <linux+etnaviv@armlinux.org.uk>,
	Guenter Roeck <linux@roeck-us.net>,
	Jean Delvare <jdelvare@suse.com>,
	Alexey Klimov <aklimov@redhat.com>,
	etnaviv@lists.freedesktop.org, Borislav Petkov <bp@alien8.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	linux-hwmon@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	David Woodhouse <dwmw@amazon.co.uk>
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()
Date: Sun, 27 Jun 2021 09:47:25 -0700	[thread overview]
Message-ID: <YNirnaYw1GSxg1jK@yury-ThinkPad> (raw)
In-Reply-To: <YM4pJpNphEwvUF2F@yury-ThinkPad>

On Sat, Jun 19, 2021 at 10:28:07AM -0700, Yury Norov wrote:
> On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote:
> > On Fri, 18 Jun 2021 20:57:34 +0100,
> > Yury Norov <yury.norov@gmail.com> wrote:
> > > 
> > > The macros iterate thru all set/clear bits in a bitmap. They search a
> > > first bit using find_first_bit(), and the rest bits using find_next_bit().
> > > 
> > > Since find_next_bit() is called shortly after find_first_bit(), we can
> > > save few lines of I-cache by not using find_first_bit().
> > 
> > Really?
> > 
> > > 
> > > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > > ---
> > >  include/linux/find.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/find.h b/include/linux/find.h
> > > index 4500e8ab93e2..ae9ed52b52b8 100644
> > > --- a/include/linux/find.h
> > > +++ b/include/linux/find.h
> > > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> > >  #endif
> > >  
> > >  #define for_each_set_bit(bit, addr, size) \
> > > -	for ((bit) = find_first_bit((addr), (size));		\
> > > +	for ((bit) = find_next_bit((addr), (size), 0);		\
> > 
> > On which architecture do you observe a gain? Only 32bit ARM and m68k
> > implement their own version of find_first_bit(), and everyone else
> > uses the canonical implementation:
> 
> And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips
> and s390.
> 
> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > 
> > These architectures explicitly have different implementations for
> > find_first_bit() and find_next_bit() because they can do better
> > (whether that is true or not is another debate). I don't think you
> > should remove this optimisation until it has been measured on these
> > two architectures.
> 
> This patch is based on a series that enables separate implementation
> of find_first_bit() for all architectures; according to my tests,
> find_first* is ~ twice faster than find_next* on arm64 and x86.
> 
> https://lore.kernel.org/lkml/20210612123639.329047-1-yury.norov@gmail.com/T/#t
> 
> After applying the series, I noticed that my small kernel module that
> calls for_each_set_bit() is now using find_first_bit() to just find
> one bit, and find_next_bit() for all others. I think it's better to
> always use find_next_bit() in this case to minimize the chance of
> cache miss. But if it's not that obvious, I'll try to write some test.

This test measures the difference between for_each_set_bit() and
for_each_set_bit_from().

diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
index 5637c5711db9..1f37e99090b0 100644
--- a/lib/find_bit_benchmark.c
+++ b/lib/find_bit_benchmark.c
@@ -111,6 +111,59 @@ static int __init test_find_next_and_bit(const void *bitmap,
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+#define flush_cache_all() wbinvd()
+#endif
+
+static int __init test_for_each_set_bit(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		for_each_set_bit(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit:   %18llu ns, %6ld iterations\n",  time, cnt);
+#endif
+	return 0;
+}
+
+static int __init test_for_each_set_bit_from(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		i = 0;
+		for_each_set_bit_from(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit_from:%16llu ns, %6ld iterations\n", time, cnt);
+#endif
+	return 0;
+}
+
 static int __init find_bit_test(void)
 {
 	unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -147,6 +200,16 @@ static int __init find_bit_test(void)
 	test_find_first_bit(bitmap, BITMAP_LEN);
 	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
+	pr_err("\nStart testing for_each_bit()\n");
+
+	test_for_each_set_bit(0);
+	test_for_each_set_bit_from(0);
+
+	pr_err("\nStart testing for_each_bit() with cash flushing\n");
+
+	test_for_each_set_bit(1);
+	test_for_each_set_bit_from(1);
+
 	/*
 	 * Everything is OK. Return error just to let user run benchmark
 	 * again without annoying rmmod.

Here on each iteration: 
 - for_each_set_bit() calls find_first_bit() once, and find_next_bit() once.
 - for_each_set_bit_from() calls  find_next_bit() twice.

On my AMD Ryzen 7 4700U, the result is like this:

Start testing for_each_bit()
for_each_set_bit:                15296 ns,   1000 iterations
for_each_set_bit_from:           15225 ns,   1000 iterations

Start testing for_each_bit() with cash flushing
for_each_set_bit:               547626 ns,   1000 iterations
for_each_set_bit_from:          497899 ns,   1000 iterations

for_each_set_bit_from() is ~10% faster than for_each_set_bit() in
case of cold caches, and no significant difference was observed if
flush_cache_all() is not called.

So, it looks reasonable to switch for_each_set_bit() to use
find_next_bit() only.

Thanks,
Yury

  reply	other threads:[~2021-06-27 16:47 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-18 19:57 [PATCH 0/3] for_each_*_bit: move to find.h and reconsider Yury Norov
2021-06-18 19:57 ` [PATCH 1/3] include/linux: move for_each_bit() macros from bitops.h to find.h Yury Norov
2021-06-19 10:50   ` Andy Shevchenko
2021-06-19 10:50     ` Andy Shevchenko
2021-06-18 19:57 ` [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit() Yury Norov
2021-06-19 10:50   ` Andy Shevchenko
2021-06-19 10:50     ` Andy Shevchenko
2021-06-19 16:24   ` Marc Zyngier
2021-06-19 16:24     ` Marc Zyngier
2021-06-19 17:28     ` Yury Norov
2021-06-19 17:28       ` Yury Norov
2021-06-27 16:47       ` Yury Norov [this message]
2021-06-27 16:47         ` Yury Norov
2021-06-18 19:57 ` [PATCH 3/3] Replace for_each_*_bit_from() with for_each_*_bit() where appropriate Yury Norov
2021-06-19 10:49   ` Andy Shevchenko
2021-06-19 10:49     ` Andy Shevchenko
2021-06-19 10:55     ` Andy Shevchenko
2021-06-19 10:55       ` Andy Shevchenko
2021-06-21 20:17   ` Guenter Roeck
2021-06-21 20:17     ` Guenter Roeck
2021-06-21 21:34     ` Yury Norov
2021-06-21 21:34       ` Yury Norov
2021-07-28 14:57 ` [PATCH 0/3] for_each_*_bit: move to find.h and reconsider Yury Norov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YNirnaYw1GSxg1jK@yury-ThinkPad \
    --to=yury.norov@gmail.com \
    --cc=airlied@linux.ie \
    --cc=aklimov@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=christian.gmeiner@gmail.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dwmw@amazon.co.uk \
    --cc=etnaviv@lists.freedesktop.org \
    --cc=geert+renesas@glider.be \
    --cc=hpa@zytor.com \
    --cc=jdelvare@suse.com \
    --cc=l.stach@pengutronix.de \
    --cc=linux+etnaviv@armlinux.org.uk \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=linux@roeck-us.net \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=richard.weiyang@linux.alibaba.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.