linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Kjellerstedt <peter.kjellerstedt@axis.com>
To: "'Timothy Miller'" <miller@techsource.com>,
	Daniel Forrest <forrest@lmcg.wisc.edu>
Cc: "'Willy Tarreau'" <willy@w.ods.org>,
	linux-kernel mailing list <linux-kernel@vger.kernel.org>
Subject: RE: generic strncpy - off-by-one error
Date: Wed, 20 Aug 2003 09:43:10 +0200	[thread overview]
Message-ID: <D069C7355C6E314B85CF36761C40F9A42E20D1@mailse02.se.axis.com> (raw)

[ I combined all my answers to your four mails into this one. ]

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Monday, August 18, 2003 18:07
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> Peter Kjellerstedt wrote:
> > For loops       2.867568    5.620561    8.128734   28.286289  
> > Multi byte fill 2.868031    5.670782    6.312027   11.336015  
> > 
> > And here are the numbers for my P4:
> > 
> > For loops       3.060262    5.927378    8.796814   30.659818  
> > Multi byte fill 3.126607    5.898459    7.096685   13.135379  
> > 
> > So there is no doubt that the multi byte version is a clear
> > winner (which was expected, I suppose).
> 
> Cool!  Hey, is this just an exercise, or are we actually going to
> use this?  I would be very happy to have something I contributed 
> to put into the kernel.  :)

I have no idea. But it sure seems like the generic strncpy() could
use some improvement (and probably strcpy() too). However, I guess
one can argue that it is better to keep the generic implementations
as simple as possible, and then have each architecture implement
their own optimized versions.

> > Here is the code that I used:
> > 
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count && *src) {
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	if (count) {
> 
> Good idea... bad to do so many checks if count is zero.  On the
> other hand, if count is rarely zero, then it's a loss.  Maybe 
> benchmark with and without?

Actually, I think this change would be lost in the noise in
the measurements.

> > 		size_t count2;
> > 
> > 		while (count & (sizeof(long) - 1)) {
> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 
> > 		count2 = count / sizeof(long);
> 
> I know that a good compiler should migrate code to help the CPU 
> pipeline, but how about moving this "count2 = " line up to before 
> the first fill loop.  See if that helps any.  Always good to 
> precompute well in advance.

Cannot do that as the first loop modifies count. 

> > 		while (count2) {
> > 			*((long *)tmp)++ = '\0';
> > 			count2--;
> > 		}
> > 
> > 		count &= (sizeof(long) - 1);
> 
> And move this to before the middle fill loop.

In my later implementations this is not possible any longer.

> > 		while (count) {
> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 	}
> > 
> > 	return dest;
> > }


> Daniel Forrest wrote:
> > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
> 
> > Shouldn't this be:
> > 
> > 		while (tmp & (sizeof(long) - 1)) {
> > 
> > 
> >>			*tmp++ = '\0';
> >>			count--;
> >>		}
> 
> Oh, yeah!  That's right.  We need to check the address.  Also need to 
> cast tmp to (int) or something (doesn't matter what it's cast to, 
> because we only care about the lower 2 or 3 bits).
> 
> Peter, please see if this makes any speed difference.  But it 
> definately needs this fix.

Yes, I added it in my later versions.

> Frankly, I'm surprised it works.  In fact, it might not, but 
> it's hard to tell from the tests just benchmarks.

I actually added verification to my benchmarking program too,
so the later versions I mailed are verified to work the same
as the standard glibc implementation at least.

> Also, if you're doing dense addressing on Alpha, and you do byte 
> accesses the addresses for char are byte addresses, but the code
> does read-modify-write to memory for byte accesses, because in 
> that mode, you can only do 32-bit and 64-bit accesses.  The 
> performance increase could be even greater for Alpha than for x86.
> 
> For Sparc, we might be able to do something with VIS instructions, 
> although I don't know what the setup overhead is.  Sun's memcpy and 
> memset only use VIS when the size is greater than 512, because 
> otherwise, it's not worth it.
> 
> I don't know enough about PowerPC other than the proper use of the 
> "eieio" instruction. :)

Remember that many architectures already have their own architecture
specific implementations. Also note that most of them have not been
modified to do the null-padding... The following architectures have
their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
ppc, ppc64, s390 and sh. The following use the generic implementation:
arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
x86_64.


> Daniel Forrest wrote:
> 
> >  
> > -	if (count) {
> > +	if (count >= sizeof(long)) {
> >  		size_t count2;
> >  
> 
> I like this size check here, but the comparison should be to 
> some number greater than sizeof(long).  There is a threshold 
> below which it's not worth it to do all of the extra loops.  
> If you're going to fill only four bytes, it's probably best 
> to do it just using the last loop.
> 
> Maybe through some trial and error, we could determine what that 
> threshold is.  I'm betting it's something around 2* or 3* word size.

The problem is that this probably varies a lot between different 
architectures, and also processor speeds. Probably best left for
the architecture specific implementations.


> Peter Kjellerstedt wrote:
> > For unaligned source or destination the "Multi copy & fill" 
> > would degenerate into "Multi byte fill". However, for
> > architectures like ix86 and CRIS that can do unaligned long
> > access, it would be a win to remove the UNALIGNED() check,
> > and use long word copying all the time.
> 
> In fact, it's possible to do the copy even if the source and 
> dest are not aligned.  It requires holding pieces of source in 
> a register and doing shifts.  If the CPU is much faster than 
> the memory, this can be a huge win.

This is true. However, this too is probably best left for the
architecture specific implementations.

> > Then whether using memset() or your filling is a win depends
> > on the architecture and how many bytes needs to be filled.
> > For a slow processor with little function call overhead (like
> > CRIS), using memset seems to be a win almost immediately.
> > However, for a fast processor like my P4, the call to memset
> > is not a win until some 1500 bytes need to be filled.
> 
> What is in memset that isn't in the fill code I suggested?

In the CRIS case it sets 12 registers to zero and uses the 
movem instruction to copy them to memory at once (movem is 
normally used to store/restore registers to/from the stack 
on function entry/exit). Thus it can clear 48 bytes each 
time through the loop. And of course the rest of the function
is hand crafted to give the best performance for any count.

//Peter

             reply	other threads:[~2003-08-20  7:48 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-20  7:43 Peter Kjellerstedt [this message]
  -- strict thread matches above, loose matches on Subject: below --
2003-08-16 21:10 generic strncpy - off-by-one error Peter Kjellerstedt
2003-08-18 18:41 ` Timothy Miller
2003-08-16 20:08 Peter Kjellerstedt
2003-08-16  9:19 Peter Kjellerstedt
2003-08-16 10:04 ` Daniel Forrest
2003-08-18 16:40   ` Timothy Miller
2003-08-16  8:15 Peter Kjellerstedt
2003-08-16  8:41 ` Daniel Forrest
2003-08-18 16:17   ` Timothy Miller
2003-08-18 16:06 ` Timothy Miller
2003-08-15  9:54 Peter Kjellerstedt
2003-08-15 17:52 ` Timothy Miller
2003-08-15  9:53 Peter Kjellerstedt
2003-08-15 17:47 ` Timothy Miller
2003-08-14  9:34 Peter Kjellerstedt
2003-08-14 19:45 ` Willy Tarreau
2003-08-14 20:24 ` Timothy Miller
2003-08-13  3:09 Anthony Truong
2003-08-13  2:18 Albert Cahalan
2003-08-13  2:47 ` Erik Andersen
2003-08-13  3:38   ` Albert Cahalan
2003-08-13  3:56     ` Nick Piggin
2003-08-13  5:18     ` Willy Tarreau
2003-08-13 19:03   ` Timothy Miller
2003-08-12 14:07 Yoshinori Sato
2003-08-12 14:39 ` Willy Tarreau
2003-08-12 14:50 ` Yoshinori Sato
2003-08-12 15:03   ` Valdis.Kletnieks
2003-08-12 15:54     ` William Gallafent
2003-08-12 16:19       ` Valdis.Kletnieks
2003-08-12 16:09     ` Andreas Schwab
2003-08-12  1:56 Anthony Truong
2003-08-12 17:14 ` Alan Cox
2003-08-12 21:53   ` Bernd Petrovitsch
2003-08-12  1:28 Anthony Truong
2003-08-12 16:24 ` Bernd Petrovitsch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D069C7355C6E314B85CF36761C40F9A42E20D1@mailse02.se.axis.com \
    --to=peter.kjellerstedt@axis.com \
    --cc=forrest@lmcg.wisc.edu \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miller@techsource.com \
    --cc=willy@w.ods.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).