From: Peter Kjellerstedt <peter.kjellerstedt@axis.com>
To: "'Timothy Miller'" <miller@techsource.com>,
Daniel Forrest <forrest@lmcg.wisc.edu>
Cc: "'Willy Tarreau'" <willy@w.ods.org>,
linux-kernel mailing list <linux-kernel@vger.kernel.org>
Subject: RE: generic strncpy - off-by-one error
Date: Wed, 20 Aug 2003 09:43:10 +0200 [thread overview]
Message-ID: <D069C7355C6E314B85CF36761C40F9A42E20D1@mailse02.se.axis.com> (raw)
[ I combined all my answers to your four mails into this one. ]
> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com]
> Sent: Monday, August 18, 2003 18:07
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
> > For loops 2.867568 5.620561 8.128734 28.286289
> > Multi byte fill 2.868031 5.670782 6.312027 11.336015
> >
> > And here are the numbers for my P4:
> >
> > For loops 3.060262 5.927378 8.796814 30.659818
> > Multi byte fill 3.126607 5.898459 7.096685 13.135379
> >
> > So there is no doubt that the multi byte version is a clear
> > winner (which was expected, I suppose).
>
> Cool! Hey, is this just an exercise, or are we actually going to
> use this? I would be very happy to have something I contributed
> to put into the kernel. :)
I have no idea. But it sure seems like the generic strncpy() could
use some improvement (and probably strcpy() too). However, I guess
one can argue that it is better to keep the generic implementations
as simple as possible, and then have each architecture implement
their own optimized versions.
> > Here is the code that I used:
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count && *src) {
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > if (count) {
>
> Good idea... bad to do so many checks if count is zero. On the
> other hand, if count is rarely zero, then it's a loss. Maybe
> benchmark with and without?
Actually, I think this change would be lost in the noise in
the measurements.
> > size_t count2;
> >
> > while (count & (sizeof(long) - 1)) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > count2 = count / sizeof(long);
>
> I know that a good compiler should migrate code to help the CPU
> pipeline, but how about moving this "count2 = " line up to before
> the first fill loop. See if that helps any. Always good to
> precompute well in advance.
Cannot do that as the first loop modifies count.
> > while (count2) {
> > *((long *)tmp)++ = '\0';
> > count2--;
> > }
> >
> > count &= (sizeof(long) - 1);
>
> And move this to before the middle fill loop.
In my later implementations this is not possible any longer.
> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> > }
> >
> > return dest;
> > }
> Daniel Forrest wrote:
> > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
>
> > Shouldn't this be:
> >
> > while (tmp & (sizeof(long) - 1)) {
> >
> >
> >> *tmp++ = '\0';
> >> count--;
> >> }
>
> Oh, yeah! That's right. We need to check the address. Also need to
> cast tmp to (int) or something (doesn't matter what it's cast to,
> because we only care about the lower 2 or 3 bits).
>
> Peter, please see if this makes any speed difference. But it
> definately needs this fix.
Yes, I added it in my later versions.
> Frankly, I'm surprised it works. In fact, it might not, but
> it's hard to tell from the tests just benchmarks.
I actually added verification to my benchmarking program too,
so the later versions I mailed are verified to work the same
as the standard glibc implementation at least.
> Also, if you're doing dense addressing on Alpha, and you do byte
> accesses the addresses for char are byte addresses, but the code
> does read-modify-write to memory for byte accesses, because in
> that mode, you can only do 32-bit and 64-bit accesses. The
> performance increase could be even greater for Alpha than for x86.
>
> For Sparc, we might be able to do something with VIS instructions,
> although I don't know what the setup overhead is. Sun's memcpy and
> memset only use VIS when the size is greater than 512, because
> otherwise, it's not worth it.
>
> I don't know enough about PowerPC other than the proper use of the
> "eieio" instruction. :)
Remember that many architectures already have their own architecture
specific implementations. Also note that most of them have not been
modified to do the null-padding... The following architectures have
their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
ppc, ppc64, s390 and sh. The following use the generic implementation:
arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
x86_64.
> Daniel Forrest wrote:
>
> >
> > - if (count) {
> > + if (count >= sizeof(long)) {
> > size_t count2;
> >
>
> I like this size check here, but the comparison should be to
> some number greater than sizeof(long). There is a threshold
> below which it's not worth it to do all of the extra loops.
> If you're going to fill only four bytes, it's probably best
> to do it just using the last loop.
>
> Maybe through some trial and error, we could determine what that
> threshold is. I'm betting it's something around 2* or 3* word size.
The problem is that this probably varies a lot between different
architectures, and also processor speeds. Probably best left for
the architecture specific implementations.
> Peter Kjellerstedt wrote:
> > For unaligned source or destination the "Multi copy & fill"
> > would degenerate into "Multi byte fill". However, for
> > architectures like ix86 and CRIS that can do unaligned long
> > access, it would be a win to remove the UNALIGNED() check,
> > and use long word copying all the time.
>
> In fact, it's possible to do the copy even if the source and
> dest are not aligned. It requires holding pieces of source in
> a register and doing shifts. If the CPU is much faster than
> the memory, this can be a huge win.
This is true. However, this too is probably best left for the
architecture specific implementations.
> > Then whether using memset() or your filling is a win depends
> > on the architecture and how many bytes needs to be filled.
> > For a slow processor with little function call overhead (like
> > CRIS), using memset seems to be a win almost immediately.
> > However, for a fast processor like my P4, the call to memset
> > is not a win until some 1500 bytes need to be filled.
>
> What is in memset that isn't in the fill code I suggested?
In the CRIS case it sets 12 registers to zero and uses the
movem instruction to copy them to memory at once (movem is
normally used to store/restore registers to/from the stack
on function entry/exit). Thus it can clear 48 bytes each
time through the loop. And of course the rest of the function
is hand crafted to give the best performance for any count.
//Peter
next reply other threads:[~2003-08-20 7:48 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-08-20 7:43 Peter Kjellerstedt [this message]
-- strict thread matches above, loose matches on Subject: below --
2003-08-16 21:10 generic strncpy - off-by-one error Peter Kjellerstedt
2003-08-18 18:41 ` Timothy Miller
2003-08-16 20:08 Peter Kjellerstedt
2003-08-16 9:19 Peter Kjellerstedt
2003-08-16 10:04 ` Daniel Forrest
2003-08-18 16:40 ` Timothy Miller
2003-08-16 8:15 Peter Kjellerstedt
2003-08-16 8:41 ` Daniel Forrest
2003-08-18 16:17 ` Timothy Miller
2003-08-18 16:06 ` Timothy Miller
2003-08-15 9:54 Peter Kjellerstedt
2003-08-15 17:52 ` Timothy Miller
2003-08-15 9:53 Peter Kjellerstedt
2003-08-15 17:47 ` Timothy Miller
2003-08-14 9:34 Peter Kjellerstedt
2003-08-14 19:45 ` Willy Tarreau
2003-08-14 20:24 ` Timothy Miller
2003-08-13 3:09 Anthony Truong
2003-08-13 2:18 Albert Cahalan
2003-08-13 2:47 ` Erik Andersen
2003-08-13 3:38 ` Albert Cahalan
2003-08-13 3:56 ` Nick Piggin
2003-08-13 5:18 ` Willy Tarreau
2003-08-13 19:03 ` Timothy Miller
2003-08-12 14:07 Yoshinori Sato
2003-08-12 14:39 ` Willy Tarreau
2003-08-12 14:50 ` Yoshinori Sato
2003-08-12 15:03 ` Valdis.Kletnieks
2003-08-12 15:54 ` William Gallafent
2003-08-12 16:19 ` Valdis.Kletnieks
2003-08-12 16:09 ` Andreas Schwab
2003-08-12 1:56 Anthony Truong
2003-08-12 17:14 ` Alan Cox
2003-08-12 21:53 ` Bernd Petrovitsch
2003-08-12 1:28 Anthony Truong
2003-08-12 16:24 ` Bernd Petrovitsch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=D069C7355C6E314B85CF36761C40F9A42E20D1@mailse02.se.axis.com \
--to=peter.kjellerstedt@axis.com \
--cc=forrest@lmcg.wisc.edu \
--cc=linux-kernel@vger.kernel.org \
--cc=miller@techsource.com \
--cc=willy@w.ods.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).