Re: [PATCH v3 3/3] riscv: optimized memset

From: Matteo Croce <mcroce@linux.microsoft.com>
To: Nick Kossifidis <mick@ics.forth.gr>
Cc: linux-riscv <linux-riscv@lists.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	Atish Patra <atish.patra@wdc.com>,
	Emil Renner Berthing <kernel@esmil.dk>,
	Akira Tsukamoto <akira.tsukamoto@gmail.com>,
	Drew Fustini <drew@beagleboard.org>,
	Bin Meng <bmeng.cn@gmail.com>,
	David Laight <David.Laight@aculab.com>,
	Guo Ren <guoren@kernel.org>
Subject: Re: [PATCH v3 3/3] riscv: optimized memset
Date: Wed, 23 Jun 2021 02:08:30 +0200	[thread overview]
Message-ID: <CAFnufp2w1TGtaBjfTtsBpDatgAtATRZbB4MURV3tLh1fi-W1JQ@mail.gmail.com> (raw)
In-Reply-To: <17cd289430f08f2b75b7f04242c646f6@mailhost.ics.forth.gr>

On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +     union types dest = { .u8 = s };
> > +
> > +     if (count >= MIN_THRESHOLD) {
> > +             const int bytes_long = BITS_PER_LONG / 8;
>
> You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const
> int mask = bytes_long - 1;' from your memcpy patch visible to memset as
> well (static const...) and use them here (mask would make more sense to
> be named as word_mask).
>

I'll do

> > +             unsigned long cu = (unsigned long)c;
> > +
> > +             /* Compose an ulong with 'c' repeated 4/8 times */
> > +             cu |= cu << 8;
> > +             cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +             cu |= cu << 32;
> > +#endif
> > +
>
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.
>

I tried that, but this way I have to wait 8 bytes more before starting
the memset.
And, the machine code needed to generate 'cu' is just 6 instructions on riscv:

slli a5,a0,8
or a5,a5,a0
slli a0,a5,16
or a0,a0,a5
slli a5,a0,32
or a0,a5,a0

so probably it's not worth it.

> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +             /* Fill the buffer one byte at time until the destination
> > +              * is aligned on a 32/64 bit boundary.
> > +              */
> > +             for (; count && dest.uptr % bytes_long; count--)
>
> You could reuse & mask here instead of % bytes_long.
>

Sure, even if the machine code will be the same.

> > +                     *dest.u8++ = c;
> > +#endif
>
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

It doesn't _now_, but maybe in the future we will have a CPU which
handles unaligned accesses correctly!

-- 
per aspera ad upstream