All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] lib: memcmp optimization
@ 2018-10-09 14:28 Jack Wang
  2018-10-09 23:13 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Jack Wang @ 2018-10-09 14:28 UTC (permalink / raw)
  To: gregkh, akpm, florian-ewald.mueller, linux-kernel; +Cc: Jack Wang

From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>

During testing, I have configured 128 md/raid1's and, while under
heavy IO, I started a check on each of them
(echo check > /sys/block/mdx/md/sync_action).

The CPU utilization went through the ceiling and when looking for
the cause (with 'perf top'). I've discovered that ~50% of the time
was spend in memcmp() called from process_checks().

With this patch applied, it drops to 4% - 10%.

Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
[jwang: reformat the commit message]
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 lib/string.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/lib/string.c b/lib/string.c
index 2c0900a5d51a..932ef9af2baa 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -852,7 +852,7 @@ EXPORT_SYMBOL(memmove);
  * @count: The size of the area.
  */
 #undef memcmp
-__visible int memcmp(const void *cs, const void *ct, size_t count)
+static inline int __memcmp(const void *cs, const void *ct, size_t count)
 {
 	const unsigned char *su1, *su2;
 	int res = 0;
@@ -862,6 +862,20 @@ __visible int memcmp(const void *cs, const void *ct, size_t count)
 			break;
 	return res;
 }
+__visible int memcmp(const void *cs, const void *ct, size_t count)
+{
+	const uint64_t *l1p = cs;
+	const uint64_t *l2p = ct;
+
+	while (count >= sizeof(*l1p)) {
+		if (*l1p != *l2p)
+			return __memcmp(l1p, l2p, sizeof(*l1p));
+		count -= sizeof(*l1p);
+		++l1p;
+		++l2p;
+	}
+	return __memcmp(l1p, l2p, count);
+}
 EXPORT_SYMBOL(memcmp);
 #endif
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] lib: memcmp optimization
  2018-10-09 14:28 [PATCH] lib: memcmp optimization Jack Wang
@ 2018-10-09 23:13 ` Andrew Morton
  2018-10-10  7:33   ` Jinpu Wang
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2018-10-09 23:13 UTC (permalink / raw)
  To: Jack Wang; +Cc: gregkh, florian-ewald.mueller, linux-kernel, Jack Wang

On Tue,  9 Oct 2018 16:28:11 +0200 Jack Wang <xjtuwjp@gmail.com> wrote:

> From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
> 
> During testing, I have configured 128 md/raid1's and, while under
> heavy IO, I started a check on each of them
> (echo check > /sys/block/mdx/md/sync_action).
> 
> The CPU utilization went through the ceiling and when looking for
> the cause (with 'perf top'). I've discovered that ~50% of the time
> was spend in memcmp() called from process_checks().
> 
> With this patch applied, it drops to 4% - 10%.

Which CPU architecture?  Most important architectures appear to define
__HAVE_ARCH_MEMCMP.

> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -852,7 +852,7 @@ EXPORT_SYMBOL(memmove);
>   * @count: The size of the area.
>   */
>  #undef memcmp
> -__visible int memcmp(const void *cs, const void *ct, size_t count)
> +static inline int __memcmp(const void *cs, const void *ct, size_t count)

What the heck does __visible do?

>  {
>  	const unsigned char *su1, *su2;
>  	int res = 0;
> @@ -862,6 +862,20 @@ __visible int memcmp(const void *cs, const void *ct, size_t count)
>  			break;
>  	return res;
>  }
> +__visible int memcmp(const void *cs, const void *ct, size_t count)
> +{
> +	const uint64_t *l1p = cs;
> +	const uint64_t *l2p = ct;
> +
> +	while (count >= sizeof(*l1p)) {
> +		if (*l1p != *l2p)
> +			return __memcmp(l1p, l2p, sizeof(*l1p));
> +		count -= sizeof(*l1p);
> +		++l1p;
> +		++l2p;
> +	}
> +	return __memcmp(l1p, l2p, count);
> +}

This is going to do bad things if the incoming addresses aren't
suitably aligned.

Certainly, byte-at-a-time is a pretty lame implementation when the
addresses are suitably aligned.  A fallback to the lame version when
there is misalignment will be simple to do.  And presumably there will
be decent benefits to whoever is actually using this code.  But I'm
wondering who is actually using this code!


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] lib: memcmp optimization
  2018-10-09 23:13 ` Andrew Morton
@ 2018-10-10  7:33   ` Jinpu Wang
  0 siblings, 0 replies; 3+ messages in thread
From: Jinpu Wang @ 2018-10-10  7:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang Jinpu, Greg Kroah-Hartman, Florian-Ewald Müller,
	linux-kernel, jannh, x86

On Wed, Oct 10, 2018 at 1:13 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  9 Oct 2018 16:28:11 +0200 Jack Wang <xjtuwjp@gmail.com> wrote:
>
> > From: Florian-Ewald Mueller <florian-ewald.mueller@profitbricks.com>
> >
> > During testing, I have configured 128 md/raid1's and, while under
> > heavy IO, I started a check on each of them
> > (echo check > /sys/block/mdx/md/sync_action).
> >
> > The CPU utilization went through the ceiling and when looking for
> > the cause (with 'perf top'). I've discovered that ~50% of the time
> > was spend in memcmp() called from process_checks().
> >
> > With this patch applied, it drops to 4% - 10%.
>
> Which CPU architecture?  Most important architectures appear to define
> __HAVE_ARCH_MEMCMP.
x86_64 is our case, I should have documented it more clearly.
>
> > --- a/lib/string.c
> > +++ b/lib/string.c
> > @@ -852,7 +852,7 @@ EXPORT_SYMBOL(memmove);
> >   * @count: The size of the area.
> >   */
> >  #undef memcmp
> > -__visible int memcmp(const void *cs, const void *ct, size_t count)
> > +static inline int __memcmp(const void *cs, const void *ct, size_t count)
>
> What the heck does __visible do?
>
> >  {
> >       const unsigned char *su1, *su2;
> >       int res = 0;
> > @@ -862,6 +862,20 @@ __visible int memcmp(const void *cs, const void *ct, size_t count)
> >                       break;
> >       return res;
> >  }
> > +__visible int memcmp(const void *cs, const void *ct, size_t count)
> > +{
> > +     const uint64_t *l1p = cs;
> > +     const uint64_t *l2p = ct;
> > +
> > +     while (count >= sizeof(*l1p)) {
> > +             if (*l1p != *l2p)
> > +                     return __memcmp(l1p, l2p, sizeof(*l1p));
> > +             count -= sizeof(*l1p);
> > +             ++l1p;
> > +             ++l2p;
> > +     }
> > +     return __memcmp(l1p, l2p, count);
> > +}
>
> This is going to do bad things if the incoming addresses aren't
> suitably aligned.
>
Right, as you said, this is easy to fix.
> Certainly, byte-at-a-time is a pretty lame implementation when the
> addresses are suitably aligned.  A fallback to the lame version when
> there is misalignment will be simple to do.  And presumably there will
> be decent benefits to whoever is actually using this code.  But I'm
> wondering who is actually using this code!
>
I found recent discussion about why x86-64 is using generic string
function here:
+cc John, x86
https://lkml.org/lkml/2018/10/3/818


-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-10-10  7:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-09 14:28 [PATCH] lib: memcmp optimization Jack Wang
2018-10-09 23:13 ` Andrew Morton
2018-10-10  7:33   ` Jinpu Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.