linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: generic strncpy - off-by-one error
@ 2003-08-20  7:43 Peter Kjellerstedt
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-20  7:43 UTC (permalink / raw)
  To: 'Timothy Miller', Daniel Forrest
  Cc: 'Willy Tarreau', linux-kernel mailing list

[ I combined all my answers to your four mails into this one. ]

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Monday, August 18, 2003 18:07
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> Peter Kjellerstedt wrote:
> > For loops       2.867568    5.620561    8.128734   28.286289  
> > Multi byte fill 2.868031    5.670782    6.312027   11.336015  
> > 
> > And here are the numbers for my P4:
> > 
> > For loops       3.060262    5.927378    8.796814   30.659818  
> > Multi byte fill 3.126607    5.898459    7.096685   13.135379  
> > 
> > So there is no doubt that the multi byte version is a clear
> > winner (which was expected, I suppose).
> 
> Cool!  Hey, is this just an exercise, or are we actually going to
> use this?  I would be very happy to have something I contributed 
> to put into the kernel.  :)

I have no idea. But it sure seems like the generic strncpy() could
use some improvement (and probably strcpy() too). However, I guess
one can argue that it is better to keep the generic implementations
as simple as possible, and then have each architecture implement
their own optimized versions.

> > Here is the code that I used:
> > 
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count && *src) {
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	if (count) {
> 
> Good idea... bad to do so many checks if count is zero.  On the
> other hand, if count is rarely zero, then it's a loss.  Maybe 
> benchmark with and without?

Actually, I think this change would be lost in the noise in
the measurements.

> > 		size_t count2;
> > 
> > 		while (count & (sizeof(long) - 1)) {
> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 
> > 		count2 = count / sizeof(long);
> 
> I know that a good compiler should migrate code to help the CPU 
> pipeline, but how about moving this "count2 = " line up to before 
> the first fill loop.  See if that helps any.  Always good to 
> precompute well in advance.

Cannot do that as the first loop modifies count. 

> > 		while (count2) {
> > 			*((long *)tmp)++ = '\0';
> > 			count2--;
> > 		}
> > 
> > 		count &= (sizeof(long) - 1);
> 
> And move this to before the middle fill loop.

In my later implementations this is not possible any longer.

> > 		while (count) {
> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 	}
> > 
> > 	return dest;
> > }


> Daniel Forrest wrote:
> > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
> 
> > Shouldn't this be:
> > 
> > 		while (tmp & (sizeof(long) - 1)) {
> > 
> > 
> >>			*tmp++ = '\0';
> >>			count--;
> >>		}
> 
> Oh, yeah!  That's right.  We need to check the address.  Also need to 
> cast tmp to (int) or something (doesn't matter what it's cast to, 
> because we only care about the lower 2 or 3 bits).
> 
> Peter, please see if this makes any speed difference.  But it 
> definately needs this fix.

Yes, I added it in my later versions.

> Frankly, I'm surprised it works.  In fact, it might not, but 
> it's hard to tell from the tests just benchmarks.

I actually added verification to my benchmarking program too,
so the later versions I mailed are verified to work the same
as the standard glibc implementation at least.

> Also, if you're doing dense addressing on Alpha, and you do byte 
> accesses the addresses for char are byte addresses, but the code
> does read-modify-write to memory for byte accesses, because in 
> that mode, you can only do 32-bit and 64-bit accesses.  The 
> performance increase could be even greater for Alpha than for x86.
> 
> For Sparc, we might be able to do something with VIS instructions, 
> although I don't know what the setup overhead is.  Sun's memcpy and 
> memset only use VIS when the size is greater than 512, because 
> otherwise, it's not worth it.
> 
> I don't know enough about PowerPC other than the proper use of the 
> "eieio" instruction. :)

Remember that many architectures already have their own architecture
specific implementations. Also note that most of them have not been
modified to do the null-padding... The following architectures have
their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
ppc, ppc64, s390 and sh. The following use the generic implementation:
arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
x86_64.


> Daniel Forrest wrote:
> 
> >  
> > -	if (count) {
> > +	if (count >= sizeof(long)) {
> >  		size_t count2;
> >  
> 
> I like this size check here, but the comparison should be to 
> some number greater than sizeof(long).  There is a threshold 
> below which it's not worth it to do all of the extra loops.  
> If you're going to fill only four bytes, it's probably best 
> to do it just using the last loop.
> 
> Maybe through some trial and error, we could determine what that 
> threshold is.  I'm betting it's something around 2* or 3* word size.

The problem is that this probably varies a lot between different 
architectures, and also processor speeds. Probably best left for
the architecture specific implementations.


> Peter Kjellerstedt wrote:
> > For unaligned source or destination the "Multi copy & fill" 
> > would degenerate into "Multi byte fill". However, for
> > architectures like ix86 and CRIS that can do unaligned long
> > access, it would be a win to remove the UNALIGNED() check,
> > and use long word copying all the time.
> 
> In fact, it's possible to do the copy even if the source and 
> dest are not aligned.  It requires holding pieces of source in 
> a register and doing shifts.  If the CPU is much faster than 
> the memory, this can be a huge win.

This is true. However, this too is probably best left for the
architecture specific implementations.

> > Then whether using memset() or your filling is a win depends
> > on the architecture and how many bytes needs to be filled.
> > For a slow processor with little function call overhead (like
> > CRIS), using memset seems to be a win almost immediately.
> > However, for a fast processor like my P4, the call to memset
> > is not a win until some 1500 bytes need to be filled.
> 
> What is in memset that isn't in the fill code I suggested?

In the CRIS case it sets 12 registers to zero and uses the 
movem instruction to copy them to memory at once (movem is 
normally used to store/restore registers to/from the stack 
on function entry/exit). Thus it can clear 48 bytes each 
time through the loop. And of course the rest of the function
is hand crafted to give the best performance for any count.

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-16 21:10 Peter Kjellerstedt
  2003-08-18 18:41 ` Timothy Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-16 21:10 UTC (permalink / raw)
  To: 'Daniel Forrest'
  Cc: 'Timothy Miller', 'Willy Tarreau',
	linux-kernel mailing list

> -----Original Message-----
> From: Daniel Forrest [mailto:forrest@lmcg.wisc.edu] 
> Sent: Saturday, August 16, 2003 12:04
> To: Peter Kjellerstedt
> Cc: 'Daniel Forrest'; 'Timothy Miller'; 'Willy Tarreau'; 
> linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> On Sat, Aug 16, 2003 at 11:19:30AM +0200, Peter Kjellerstedt wrote:
> > 
> > Actually, it should be:
> > 
> > 	while (count && ((long)tmp & (sizeof(long) - 1)))
> > 
> 
> Oops, you're right, I forgot that the count could be small.
> 
> But, now that I think of it, maybe this would be best...
> 
>  char *strncpy(char *dest, const char *src, size_t count)
>  {
>  	char *tmp = dest;
>  
>  	while (count && *src) {
>  		*tmp++ = *src++;
>  		count--;
>  	}
>  
> -	if (count) {
> +	if (count >= sizeof(long)) {
>  		size_t count2;
>  
> -		while (count && ((long)tmp & (sizeof(long) - 1))) {
> +		while ((long)tmp & (sizeof(long) - 1)) {
>  			*tmp++ = '\0';
>  			count--;
>  		}
>  
>  		count2 = count / sizeof(long);
>  		while (count2) {
>  			*((long *)tmp)++ = '\0';
>  			count2--;
>  		}
>  
>  		count &= (sizeof(long) - 1);
> -		while (count) {
> -			*tmp++ = '\0';
> -			count--;
> -		}
> +	}
> +
> +	while (count) {
> +		*tmp++ = '\0';
> +		count--;
>  	}
>  
>  	return dest;
>  }
> 
> -- 
> Dan

Looks good to me. However, this only optimizes the filling
part. So why not try to optimize the copying part too? Here
is a version that optimizes the (hopefully somewhat common)
case of both source and destination being long aligned
(UNALIGNED() and DETECT_NUL() were borrowed from newlib's
strncpy()):

#include <limits.h>

/* Non-zero if either x or y is not aligned on a "long" boundary. */
#define UNALIGNED(x, y) \
	(((long)x & (sizeof(long) - 1)) | ((long)y & (sizeof(long) - 1)))

/* DETECT_NUL() is non-zero if x contains a NUL byte. */
#if LONG_MAX == 2147483647L
#define DETECT_NUL(x) (((x) - 0x01010101) & ~(x) & 0x80808080)
#else
#if LONG_MAX == 9223372036854775807L
#define DETECT_NUL(x) (((x) - 0x0101010101010101) & ~(x) & 0x8080808080808080)
#else
#error long int is not a 32bit or 64bit type.
#endif
#endif

char *strncpy(char *dest, const char *src, size_t count)
{
	char *tmp = dest;

	if (!UNALIGNED(src, tmp)) {
		while (count >= sizeof(long)) {
			if (DETECT_NUL(*((long*)src)))
				break;

			*((long *)tmp)++ = *((long *)src)++;
			count -= sizeof(long);
		}
	}

	while (count) {
		count--;
		if (!(*tmp++ = *src++))
			break;
	}

	if (count) {
		size_t count2;

		if (count >= sizeof(long)) {
			while ((long)tmp & (sizeof(long) - 1)) {
				*tmp++ = '\0';
				count--;
			}

			count2 = count / sizeof(long);
			while (count2) {
				*((long *)tmp)++ = '\0';
				count2--;
			}
		}

		count &= (sizeof(long) - 1);
		while (count) {
			*tmp++ = '\0';
			count--;
		}
	}

	return dest;
}

And here are some benchmarking numbers for the CRIS
architecture. "Multi copy & fill" is the code above.
"Multi copy+memset" is the same as above, but with the
filling replaced by a call to memset().

Multi copy+memset 1.374258    2.627357    3.125354    4.614480  
Multi copy & fill 1.339129    2.607850    3.260540    8.303503  
Multi byte fill   2.336643    4.619374    5.290972   10.326644  
For loops         2.828718    5.591865    8.110286   40.684943  
strncpy (glibc)   2.201317    4.273052    7.281693   31.474109  
strncpy (uClibc)  2.311860    4.588357    9.112990   45.387527  

And the numbers for my P4:

Multi copy+memset 1.284029    2.111502    4.672941    8.470845  
Multi copy & fill 1.203095    2.160514    3.265682    8.645152  
Multi byte fill   3.161438    6.249649    7.144416   12.687760  
For loops         3.202569    6.028025    9.410537   33.492053  
strncpy (glibc)   2.359909    3.943612    6.952230   29.604865  
strncpy (uClibc)  3.234713    5.943762   10.137144   39.651053  

For unaligned source or destination the "Multi copy & fill" 
would degenerate into "Multi byte fill". However, for
architectures like ix86 and CRIS that can do unaligned long
access, it would be a win to remove the UNALIGNED() check,
and use long word copying all the time.

Then whether using memset() or your filling is a win depends
on the architecture and how many bytes needs to be filled.
For a slow processor with little function call overhead (like
CRIS), using memset seems to be a win almost immediately.
However, for a fast processor like my P4, the call to memset
is not a win until some 1500 bytes need to be filled.

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-16 20:08 Peter Kjellerstedt
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-16 20:08 UTC (permalink / raw)
  To: 'Timothy Miller', Peter Kjellerstedt; +Cc: linux-kernel mailing list

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Friday, August 15, 2003 19:47
> To: Peter Kjellerstedt
> Cc: linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> Peter Kjellerstedt wrote:
> 
> >>While I can understand that certain architectures may benefit 
> >>from that alteration, I am curious as to what SPECIFICALLY it 
> >>is doing that is different.  How do they differ?
> > 
> > I do not know if this will give you anything, but here is
> > the disassembled CRIS version of your first while loop 
> > (note that the branch instructions have one delay slot,
> > and that dest, src and count are in $r10, $r11 and $r12):
> > 
> >   while (count && *src)
> >    805d8:       6cc6                    test.d $r12
> >    805da:       1830                    beq 805f4
> >    805dc:       6a96                    move.d $r10,$r9
> 
> Ok, test count and exit if count is zero.
> 
> I take it that this move is the copy of the dst to a 
> temporary variable?

Yes.

> >    805de:       8b0b                    test.b [$r11]
> >    805e0:       1230                    beq 805f4
> >    805e2:       0f05                    nop 
> 
> Test *s and exit if zero.
> 
> >   {
> >     count--;
> >    805e4:       81c2                    subq 1,$r12
> 
> 
> >     *tmp++ = *src++;
> >    805e6:       4bde                    move.b [$r11+],$r13
> 
> Fetch *s and increment.  This is redundant.  Why didn't it 
> know to fetch *s above and keep it in a register?

I think this is the way gcc produces while loops (not exactly
my area of expertise).

> >    805e8:       6cc6                    test.d $r12
> >    805ea:       0830                    beq 805f4
> >    805ec:       c9df                    move.b $r13,[$r9+]
> 
> Test count and exit if count is zero.  Why is it doing this 
> again?  Why not jump back to the top?  Oh, wait... in that 
> case, for the loop to exit, it would jump to the top and then 
> jump back to past the end, rather than just failing one branch.

Yes.

> I see that the delay slot was filled with the store to *dst.  But is 
> there also some pipeline latency that makes this worth while?

As far as I know this should be a win.

> >    805ee:       8b0b                    test.b [$r11]
> >    805f0:       f320                    bne 805e4
> >    805f2:       0f05                    nop 
> 
> Test *src AGAIN and loop if nonzero.  Redundant.

No. Note that $r11 was increased after the move at 805e6,
so what it tests here is the next byte to be copied. Once
the loop is running this is the only test of *src. However,
the code isn't optimal as it causes two reads of *src, but
it was not trivial to modify the assembler code to get
optimal behavior for this version, so I can understand 
that the compiler did not make it.

> >   }
> > 
> > And here is my first while loop:
> > 
> >   while (count)
> >    8062c:       6cc6                    test.d $r12
> >    8062e:       1030                    beq 80640
> >    80630:       6a96                    move.d $r10,$r9
> 
> Test count and bypass if zero.  Also, copy dst.
> 
> >   {
> >     count--;
> >     if (!(*tmp++ = *src++))
> >    80632:       4bde                    move.b [$r11+],$r13
> >    80634:       c9db                    move.b $r13,[$r9]
> 
> Move data
> 
> >    80636:       890f                    test.b [$r9+]
> 
> Interesting how (a) it moved the increment to the test, and 
> (b) it makes a redundant read of the dest.
> 
> Does this arch not get condition codes from things other than 
> test and compare?

Only moves TO a register results in the status flags being
changed. But I do not know why the compiler doesn't produce
either 

	move.b [$r11+],$r13
	move.b $r13,[$r9+]
	test.b $r13

or even better

	move.b [$r11+],$r13
	move.b $r13,[$r9+]

for this code, as I believe they should be equivalent, but with
better cycle counts. I will ask our compiler guy next week...

> And why didn't it test $r13 instead?  Wouldn't that have been 
> a lot more efficient?

One rather than two cycles.
 
> >    80638:       0630                    beq 80640
> >    8063a:       81c2                    subq 1,$r12
> 
> Exit on zero and decrement count.
> 
> >       break;
> >    8063c:       f520                    bne 80632
> >    8063e:       0f05                    nop 
> 
> Loop.
> 
> >   }
> > 
> > 
> >>>Also note that your version
> >>>of the second loop needs an explicit  comparison with -1,
> >>>whereas mine uses an implicit comparison with 0.
> >>
> >>I don't understand why you say I need an explicit comparison 
> >>with -1. My first loop exits either with the number of bytes 
> >>remaining in the buffer or with zero if it's copied count 
> >>number of bytes.
> > 
> > 
> > I was talking about the second loop. The object code the compiler
> > produces for your version actually tests the count variable after
> > it decreases it, which is why it tests for -1.
> 
> Why does it do that?  Is that somehow more optimal?  Why 
> doesn't it test BEFORE it decrements?  Isn't that more 
> straightforward?

It would be hard to do in assembler, since after the test
instruction is executed one must do the branch instruction, 
and there is no way to do the decrementation in-between
without affecting the status flags.

We can easily change the code and move the decrementation into
it since we do not care what happens to count after the loop
(or can take the changed value into account), but the compiler
cannot (easily) do this as it would change the value count has
after the loop (-1 vs. 0).

> In any event, couldn't that be fixed by moving the decrement 
> into the loop?
> 
> > 
> >>The second loop WOULD require a comparison with -1 IF the 
> >>"count--" were not inside of the loop body.  As it IS in the 
> >>loop body, there is no need for that.  My second loop has an 
> >>implicit comparison against zero.
> > 
> > 
> > Hmm, your second loop from above looks like:
> > 
> > 	while (n--) {
> > 		*s++ = 0;
> > 	}
> > 
> > whereas mine looks like:
> > 
> >  	while (count) {
> >  		*tmp++ = '\0';
> >  		count--;
> >  	}
> > 
> > You seem to be referring to my version, where what you say is true.
> 
> And what I was saying was that since what the while should 
> test is the value before the decrement, I assumed that the 
> test would be before the decrement.

See my answer above.

> >>I agree that this is definately a more elegant look to the 
> >>code, and I would prefer what you have done here.  But what 
> >>puzzles me is that this is functionally and logically 
> >>equivalent to my code.
> >>
> >>So, this code:
> >>
> >>for (A; B; C) {}
> >>
> >>is the same as this:
> >>
> >>A;
> >>while (B) {
> >>	...
> >>	C;
> >>}
> >>
> >>
> >>So why is it that this mere syntactic difference causes the 
> >>compiler to produce a better result?
> > 
> > 
> > I wish I new. Actually in the CRIS case, it seems to be an
> > optimizer thing. If I change your first loop from
> > 
> > 	while (n && *s2) {
> > 		n--;
> > 		*s++ = *s2++;
> > 	}
> > 
> > to
> > 
> > 	while (n && *s2) {
> > 		*s++ = *s2++;
> > 		n--;
> > 	}
> > 
> > it gives the expected object code, i.e., the same as my
> > first for loop. So here is a modified version of your
> > code that gives exactly the same object code (for CRIS) 
> > as my version with the for loops:
> 
> A decent optimizer should know that the "n--" and the "*s++ = 
> *s2++" are independent and migrate them to optimal positions.

I agree. I think I will have to ask our compiler guy about
this too...

> Furthermore, I put the "n--" first because I assumed that the 
> compiler might be stupid about it.  The idea is to order the 
> statements so that the results of a given instruction are being 
> computed while the next one is being decoded.
> 
> My assumption was that the code would branch back to the loop test 
> (rather than make a second one) at the end of the loop and 
> then test n. 

Assumptions does not seem to work when dealing with the 
compiler. ;) My tests with benchmarking this code has clearly
taught me that...

>   Therefore, decrementing n earlier would be a win in terms of having 
> the result earlier for the test, reducing pipline stall.

I guess trying to optimize generic code is not very easy
as what is optimal on one architecture may be bad on some
other. However, it should be possible to get better results
than what the current implementation does.

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-16  9:19 Peter Kjellerstedt
  2003-08-16 10:04 ` Daniel Forrest
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-16  9:19 UTC (permalink / raw)
  To: 'Daniel Forrest'
  Cc: 'Timothy Miller', 'Willy Tarreau',
	linux-kernel mailing list

> -----Original Message-----
> From: Daniel Forrest [mailto:forrest@lmcg.wisc.edu] 
> Sent: Saturday, August 16, 2003 10:41
> To: Peter Kjellerstedt
> Cc: 'Timothy Miller'; 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
> > 
> > Here is the code that I used:
> > 
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count && *src) {
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	if (count) {
> > 		size_t count2;
> > 
> > 		while (count & (sizeof(long) - 1)) {
> 
> Shouldn't this be:
> 
> 		while (tmp & (sizeof(long) - 1)) {

Actually, it should be:

	while (count && ((long)tmp & (sizeof(long) - 1)))

> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 
> > 		count2 = count / sizeof(long);
> > 		while (count2) {
> > 			*((long *)tmp)++ = '\0';
> > 			count2--;
> > 		}
> > 
> > 		count &= (sizeof(long) - 1);
> > 		while (count) {
> > 			*tmp++ = '\0';
> > 			count--;
> > 		}
> > 	}
> > 
> > 	return dest;
> > }
> 
> -- 
> Dan

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-16  8:15 Peter Kjellerstedt
  2003-08-16  8:41 ` Daniel Forrest
  2003-08-18 16:06 ` Timothy Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-16  8:15 UTC (permalink / raw)
  To: 'Timothy Miller'
  Cc: 'Willy Tarreau', linux-kernel mailing list

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Friday, August 15, 2003 19:52
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> Peter Kjellerstedt wrote:
> > Timothy's       3.125003    6.128571    9.147905   33.325337  
> 
> Which of mine did you test?  The one with the single-byte 
> fill, or the one with the multiple fill loops that does 
> words for the bulk of the fills?

Stupid me. I only tested the single byte fill version.

> With some minor tweaks to eliminate compiler stupidity which compares 
> against -1, that might win on the fill phase.  No?

Here are the numbers for my for loop version and your multi
byte fill version for CRIS:

For loops       2.867568    5.620561    8.128734   28.286289  
Multi byte fill 2.868031    5.670782    6.312027   11.336015  

And here are the numbers for my P4:

For loops       3.060262    5.927378    8.796814   30.659818  
Multi byte fill 3.126607    5.898459    7.096685   13.135379  

So there is no doubt that the multi byte version is a clear
winner (which was expected, I suppose).

Here is the code that I used:

char *strncpy(char *dest, const char *src, size_t count)
{
	char *tmp = dest;

	while (count && *src) {
		*tmp++ = *src++;
		count--;
	}

	if (count) {
		size_t count2;

		while (count & (sizeof(long) - 1)) {
			*tmp++ = '\0';
			count--;
		}

		count2 = count / sizeof(long);
		while (count2) {
			*((long *)tmp)++ = '\0';
			count2--;
		}

		count &= (sizeof(long) - 1);
		while (count) {
			*tmp++ = '\0';
			count--;
		}
	}

	return dest;
}

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-15  9:54 Peter Kjellerstedt
  2003-08-15 17:52 ` Timothy Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-15  9:54 UTC (permalink / raw)
  To: 'Willy Tarreau'
  Cc: 'Timothy Miller', linux-kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 6327 bytes --]

> -----Original Message-----
> From: Willy Tarreau [mailto:willy@w.ods.org] 
> Sent: Thursday, August 14, 2003 21:45
> To: Peter Kjellerstedt
> Cc: 'Timothy Miller'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> On Thu, Aug 14, 2003 at 11:34:50AM +0200, Peter Kjellerstedt wrote:
>  
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count) {
> > 		if (*src == '\0')
> > 			break;
> > 
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	while (count) {
> > 		*tmp++ = '\0';
> > 		count--;
> > 	}
> > 
> > 	return dest;
> > }
> > 
> > Moving the check for *src == '\0' into the first loop seems
> > to let the compiler reuse the object code a little better
> > (may depend on the optimizer). Also note that your version
> > of the second loop needs an explicit  comparison with -1,
> > whereas mine uses an implicit comparison with 0.
> 
> Well, if you're at this level of comparison, then I may object that
> 'count' is evaluated twice when jumping from one loop to the 
> other one.
> 
> *Algorithmically* speaking, the most optimal code should be :
> 
> char *strncpy(char *dest, const char *src, size_t count)
> {
>         char *tmp = dest;
>         if (unlikely(!count))
>                 goto end;
> loop1:
>         if ((*tmp = *src++) == '\0')
> 		goto next;
>         tmp++;
>         if (likely(--count))
> 		goto loop1;
> 	else
> 		goto end;
> loop2:
>         *tmp = '\0';
> next:
>         tmp++;
>         if (likely(--count))
> 		goto loop2;
> end:
>         return dest;
> }

I hope we do not opt to go with this version. :)

> (sorry for those who hate gotos). Look at it : no test is 
> performed twice, no byte is read nor written twice in any 
> case. The assembly code produced is optimal on x86 (well, 
> in fact we could delete exactly one conditionnal jump because 
> the compiler doesn't know how to reuse them for several tests).
> 16 inlinable instructions (= excluding function in/out) which 
> can be executed 2 at a time if your CPU has 2 pipelines. about
> 3 cycles/copied byte, 2 cycles/filled byte.

It does not give optimal code for CRIS for the second loop.
It can easily be fixed by combining loop2 and next (which
will cause tmp to be assigned twice when changing loop),
but I know this was not the intent of this exercise.

> Unfortunately, it fools branch predictors and prefetch 
> mechanisms found in today's processors, so it results in 
> slower code than yours (at least on athlon and P3). Perhaps 
> it would be fast on older processors, I don't know.
> 
> All that demonstrates that whatever your efforts in helping 
> the optimizer, the only meaningful result is the benchmark. 
> Number of bytes and speculations on the reusability of 
> information between lines of codes are not what makes our
> programs fast anymore :-(

It was easier some years ago, wasn't it? ;)

> > Testing on the CRIS architecture, your version is 24 instructions,
> > whereas mine is 18. For comparison, Eric's one is 12 and the
> > currently used implementation is 26 (when corrected for the
> > off-by-one error by comparing with > 1 rather than != 0 in the
> > second loop).
> 
> Just out of curiosity, can the CRIS architecture execute 
> multiple instructions per cycle ? have you tried to determine 

No it cannot. :( Our ETRAX 100LX processor which uses the
CRIS architecture is a 100 MIPS processor without most of
today's fancy processor optimizations like multiple instructions,
branch prediction etc. It is designed for embedded products.

> the dependencies between the instructions ? Did you time the 
> different functions ? (yours is rather fast on x86 BTW).

No I didn't, but I have done it now.

I made a program that called the respective strncpy() 
implementation a number of times copying a 100 byte string
into a 1000 byte buffer. For each function I tested with
copying 50, 100, 200 and 1000 bytes. I have attached the 
source to this mail if anyone wants to make sure I did not
screw up the benchmarking too much... ;)

This table shows the times in seconds to call each function
500,000 times for CRIS:

                   50         100         200        1000
Original        3.123906    6.138616    9.146204   33.323051  
Eric's          2.602632    5.105367    9.637701   45.910798  
Timothy's       3.125003    6.128571    9.147905   33.325337  
My first        2.868396    5.626149    8.138134   28.276760  
My second (for) 2.622412    5.129988    7.636364   27.786745  
Algorithmic     2.597808    5.119332    7.627300   27.785999  

Going by these numbers, Eric's version is a good candidate 
for the copying phase, but loses out badly when it comes to
the filling. The best versions here seem to be my version
with the for loops and your algorithmic version which both
give almost identical numbers (but look below for more on
the algorithmic version).

This table shows the times in seconds to call each function
20,000,000 times for my P4 2.53GHz:

                   50         100         200        1000
Original        2.900444    5.475311    9.095701   37.462796  
Eric's          2.988404    5.637352    9.576857   37.580639  
Timothy's       2.929508    5.534824    9.157147   36.836899  
My first        2.951068    5.511169    8.321381   29.669017  
My second (for) 2.921675    5.531486    8.296728   28.940855  
Algorithmic     3.911937    7.343235   12.699275   54.288284  

The interesting thing here is that the original code wins
the copying phase. But the numbers for the copying phase for
the first five versions are so similar, that I doubt one can 
say one is better than the other. However, when it comes to
the filling phase my versions are back in the lead. These 
numbers also shows very clearly what you stated above about
branch predictions, as the algorithmic version loses out
badly here.

All these numbers taken together, I would say that my version
with the for loops seems to be the overall winner. :)

> To conclude, I would say that if we want to implement really 
> fast low-level primitives such as str*, mem*, ... there's 
> nothing better than assembly associated with benchmarks on 
> different CPU architectures and models.
> 
> But I don't know if strncpy() is used enough to justify that...

I agree

> Regards,
> Willy

//Peter


[-- Attachment #2: testa.c --]
[-- Type: application/octet-stream, Size: 5293 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>

#include <sys/time.h>

char *strncpy0(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count-- && (*tmp++ = *src++) != '\0')
    /* nothing */;

  return dest;
}

char *strncpy0a(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count && (*dest++ = *src++) != '\0')
    count--;

  while (count > 1)
  {
    *dest++ = 0;
    count--;
  }

  return tmp;
}

char *strncpy0b(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count && (*tmp++ = *src++) != '\0')
    count--;

  while (count > 1)
  {
    *tmp++ = 0;
    count--;
  }

  return dest;
}

char *strncpy0x(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count)
  {
    if (!(*tmp = *src++))
      break;
    tmp++;

    count--;
  }

  return dest;
}

char *strncpy1(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count)
  {
    if ((*tmp = *src))
      src++;
    tmp++;

    count--;
  }

  return dest;
}

char *strncpy1x(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  for (; count; count--)
  {
    if ((*tmp = *src))
      src++;
    tmp++;
  }

  return dest;
}

char *strncpy2(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count--)
  {
    if ((*tmp++ = *src))
      src++;
  }

  return dest;
}

char *strncpy3(char *dest, const char *src, size_t count)
{
   if (count)
   {
      char *tmp = dest;

      while (1)
      {
         *tmp = *src;
         if (*src)
           src++;
         tmp++;
         if (!count--)
           break;
      }
   }

   return dest;
}

char *strncpy4(char *dest, const char *src, size_t count)
{
   if (count)
   {
      char *tmp = dest;

      do
      {
         if ((*tmp = *src))
           ++src;
         ++tmp;
      } while (--count);
   }

   return dest;
}

char *strncpy5(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count && *src)
  {
    count--;
    *tmp++ = *src++;
  }

  while (count--)
  {
    *tmp++ = 0;
  }

  return dest;
}

char *strncpy5w(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count)
  {
    if (*src == '\0')
      break;

    *tmp++ = *src++;
    count--;
  }

  while (count)
  {
    *tmp++ = 0;
    count--;
  }

  return dest;
}

char *strncpy5x(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  while (count)
  {
    count--;
    if (!(*tmp++ = *src++))
      break;
  }

  while (count)
  {
    count--;
    *tmp++ = 0;
  }

  return dest;
}

char *strncpy5y(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  for (; count && *src; count--)
    *tmp++ = *src++;

  for (; count; count--)
    *tmp++ = '\0';

  return dest;
}

#define likely(x)    __builtin_expect(!!(x),1)
#define unlikely(x)  __builtin_expect(!!(x),0)

char *strncpy6(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  if (unlikely(!count))
    goto end;

loop1:
  if ((*tmp = *src++) == '\0')
    goto next;
  tmp++;
  if (likely(--count))
    goto loop1;
  else
    goto end;

loop2:
  *tmp = '\0';
next:
  tmp++;
  if (likely(--count))
    goto loop2;

end:
  return dest;
}

char *strncpy6x(char *dest, const char *src, size_t count)
{
  char *tmp = dest;

  if (unlikely(!count))
    goto end;

loop1:
  if ((*tmp = *src++) == '\0')
    goto loop2;
  tmp++;
  if (likely(--count))
    goto loop1;
  else
    goto end;

loop2:
  *tmp++ = '\0';
  if (likely(--count))
    goto loop2;

end:
  return dest;
}

void time_it(char *(*func)(char *dest, const char *src, size_t count),
             size_t size)
{
  struct timeval start;
  struct timeval end;
  long sec, usec;
  static char *string = "asdl hfaslkfh alksjfh laksjh lkajsdflkasflksdf ljhfljshf laksflaksfhlaksjhflaksfh lakshf lkds hflsd";
  static char buffer[1000];
#ifdef __CRIS__
  int count = 500000;
#else
  int count = 20000000;
#endif

  if (size > sizeof buffer)
  {
    fprintf(stderr, "Buffer too small!\n");
    exit(1);
  }

  gettimeofday(&start, NULL);

  while (count)
  {
    func(buffer, string, size);
    count--;
  }

  gettimeofday(&end, NULL);

  sec = end.tv_sec - start.tv_sec;
  if ((usec = end.tv_usec - start.tv_usec) < 0)
  {
    sec--;
    usec += 1000000;
  }

  printf("%3ld.%06ld  ", sec, usec);
  fflush(stdout);
}

void do_it(const char *name,
           char *(*func)(char *dest, const char *src, size_t count))
{
  printf("%-12s  ", name);
  fflush(stdout);
  time_it(func, 50);
  time_it(func, 100);
  time_it(func, 200);
  time_it(func, 1000);
  printf("\n");
}

#define TIME_IT(func) time_it(#func, func)

int main(int argc, char* argv[])
{
  do_it("Original", strncpy0a);
  do_it("Eric's", strncpy1);
  do_it("Timothy's", strncpy5);
  do_it("My first", strncpy5x);
  do_it("For loops", strncpy5y);
  do_it("Algorithmic", strncpy6);

  exit(0);
}

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-15  9:53 Peter Kjellerstedt
  2003-08-15 17:47 ` Timothy Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-15  9:53 UTC (permalink / raw)
  To: 'Timothy Miller'; +Cc: linux-kernel mailing list

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Thursday, August 14, 2003 22:24
> To: Peter Kjellerstedt
> Cc: linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> Peter Kjellerstedt wrote:
> 
> >>
> >>Nice!
> > 
> > 
> > I can but agree.
> > 
> > 
> >>How about this:
> >>
> >>
> >>char *strncpy(char * s1, const char * s2, size_t n)
> >>{
> >>	register char *s = s1;
> >>
> >>	while (n && *s2) {
> >>		n--;
> >>		*s++ = *s2++;
> >>	}
> >>	while (n--) {
> >>		*s++ = 0;
> >>	}
> >>	return s1;
> >>}
> > 
> > 
> > This may be improved further:
> > 
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count) {
> > 		if (*src == '\0')
> > 			break;
> > 
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	while (count) {
> > 		*tmp++ = '\0';
> > 		count--;
> > 	}
> > 
> > 	return dest;
> > }
> > 
> > Moving the check for *src == '\0' into the first loop seems
> > to let the compiler reuse the object code a little better
> > (may depend on the optimizer). 
> 
> While I can understand that certain architectures may benefit 
> from that alteration, I am curious as to what SPECIFICALLY it 
> is doing that is different.  How do they differ?

I do not know if this will give you anything, but here is
the disassembled CRIS version of your first while loop 
(note that the branch instructions have one delay slot,
and that dest, src and count are in $r10, $r11 and $r12):

  while (count && *src)
   805d8:       6cc6                    test.d $r12
   805da:       1830                    beq 805f4
   805dc:       6a96                    move.d $r10,$r9
   805de:       8b0b                    test.b [$r11]
   805e0:       1230                    beq 805f4
   805e2:       0f05                    nop 
  {
    count--;
   805e4:       81c2                    subq 1,$r12
    *tmp++ = *src++;
   805e6:       4bde                    move.b [$r11+],$r13
   805e8:       6cc6                    test.d $r12
   805ea:       0830                    beq 805f4
   805ec:       c9df                    move.b $r13,[$r9+]
   805ee:       8b0b                    test.b [$r11]
   805f0:       f320                    bne 805e4
   805f2:       0f05                    nop 
  }

And here is my first while loop:

  while (count)
   8062c:       6cc6                    test.d $r12
   8062e:       1030                    beq 80640
   80630:       6a96                    move.d $r10,$r9
  {
    count--;
    if (!(*tmp++ = *src++))
   80632:       4bde                    move.b [$r11+],$r13
   80634:       c9db                    move.b $r13,[$r9]
   80636:       890f                    test.b [$r9+]
   80638:       0630                    beq 80640
   8063a:       81c2                    subq 1,$r12
      break;
   8063c:       f520                    bne 80632
   8063e:       0f05                    nop 
  }

> > Also note that your version
> > of the second loop needs an explicit  comparison with -1,
> > whereas mine uses an implicit comparison with 0.
> 
> I don't understand why you say I need an explicit comparison 
> with -1. My first loop exits either with the number of bytes 
> remaining in the buffer or with zero if it's copied count 
> number of bytes.

I was talking about the second loop. The object code the compiler
produces for your version actually tests the count variable after
it decreases it, which is why it tests for -1.

> The second loop WOULD require a comparison with -1 IF the 
> "count--" were not inside of the loop body.  As it IS in the 
> loop body, there is no need for that.  My second loop has an 
> implicit comparison against zero.

Hmm, your second loop from above looks like:

	while (n--) {
		*s++ = 0;
	}

whereas mine looks like:

 	while (count) {
 		*tmp++ = '\0';
 		count--;
 	}

You seem to be referring to my version, where what you say is true.

> > Testing on the CRIS architecture, your version is 24 instructions,
> > whereas mine is 18. For comparison, Eric's one is 12 and the
> > currently used implementation is 26 (when corrected for the
> > off-by-one error by comparing with > 1 rather than != 0 in the
> > second loop).
> 
> If I understand it correctly, the corrected original needs 
> that explicit comparison because the decrement is in the loop 
> conditional.  My implementation (and yours) corrects this by 
> moving the decrement into the body of the loop.
> 
> Also, while instruction count is sometimes a good indication of 
> algorithm efficiency, I would argue that our two-loop version is 
> probably the same speed as the single-loop version for copying 
> characters, but ours is faster for zeroing the rest of the 
> target buffer.

Seems to be the case, yes.

> > Here is another version that I think is quite pleasing
> > aesthetically (YMMV) since the loops are so similar (it is 21
> > instructions on CRIS):
> > 
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	for (; count && *src; count--)
> > 		*tmp++ = *src++;
> > 
> > 	for (; count; count--)
> > 		*tmp++ = '\0';
> > 
> > 	return dest;
> > }
> 
> I agree that this is definately a more elegant look to the 
> code, and I would prefer what you have done here.  But what 
> puzzles me is that this is functionally and logically 
> equivalent to my code.
> 
> So, this code:
> 
> for (A; B; C) {}
> 
> is the same as this:
> 
> A;
> while (B) {
> 	...
> 	C;
> }
> 
> 
> So why is it that this mere syntactic difference causes the 
> compiler to produce a better result?

I wish I new. Actually in the CRIS case, it seems to be an
optimizer thing. If I change your first loop from

	while (n && *s2) {
		n--;
		*s++ = *s2++;
	}

to

	while (n && *s2) {
		*s++ = *s2++;
		n--;
	}

it gives the expected object code, i.e., the same as my
first for loop. So here is a modified version of your
code that gives exactly the same object code (for CRIS) 
as my version with the for loops:

char *strncpy(char * s1, const char * s2, size_t n)
{
	register char *s = s1;

	while (n && *s2) {
		*s++ = *s2++;
		n--;
	}
	while (n) {
		*s++ = 0;
		n--;
	}
	return s1;
}

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* RE: generic strncpy - off-by-one error
@ 2003-08-14  9:34 Peter Kjellerstedt
  2003-08-14 19:45 ` Willy Tarreau
  2003-08-14 20:24 ` Timothy Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Peter Kjellerstedt @ 2003-08-14  9:34 UTC (permalink / raw)
  To: 'Timothy Miller', linux-kernel mailing list

[ Trimmed the recipient list. ]

> -----Original Message-----
> From: Timothy Miller [mailto:miller@techsource.com] 
> Sent: Wednesday, August 13, 2003 21:04
> To: andersen@codepoet.org
> Cc: Albert Cahalan; linux-kernel mailing list; 
> bernd@firmix.at; Anthony.Truong@mascorp.com; 
> alan@lxorguk.ukuu.org.uk; schwab@suse.de; 
> ysato@users.sourceforge.jp; willy@w.ods.org; 
> Valdis.Kletnieks@vt.edu; william.gallafent@virgin.net
> Subject: Re: generic strncpy - off-by-one error
> 
> Erik Andersen wrote:
> 
> > char *strncpy(char * s1, const char * s2, size_t n)
> > {
> >     register char *s = s1;
> >     while (n) {
> > 	if ((*s = *s2) != 0) s2++;
> > 	++s;
> > 	--n;
> >     }
> >     return s1;
> > }
> 
> 
> 
> Nice!

I can but agree.

> How about this:
> 
> 
> char *strncpy(char * s1, const char * s2, size_t n)
> {
> 	register char *s = s1;
> 
> 	while (n && *s2) {
> 		n--;
> 		*s++ = *s2++;
> 	}
> 	while (n--) {
> 		*s++ = 0;
> 	}
> 	return s1;
> }

This may be improved further:

char *strncpy(char *dest, const char *src, size_t count)
{
	char *tmp = dest;

	while (count) {
		if (*src == '\0')
			break;

		*tmp++ = *src++;
		count--;
	}

	while (count) {
		*tmp++ = '\0';
		count--;
	}

	return dest;
}

Moving the check for *src == '\0' into the first loop seems
to let the compiler reuse the object code a little better
(may depend on the optimizer). Also note that your version
of the second loop needs an explicit  comparison with -1,
whereas mine uses an implicit comparison with 0.

Testing on the CRIS architecture, your version is 24 instructions,
whereas mine is 18. For comparison, Eric's one is 12 and the
currently used implementation is 26 (when corrected for the
off-by-one error by comparing with > 1 rather than != 0 in the
second loop).

Here is another version that I think is quite pleasing
aesthetically (YMMV) since the loops are so similar (it is 21
instructions on CRIS):

char *strncpy(char *dest, const char *src, size_t count)
{
	char *tmp = dest;

	for (; count && *src; count--)
		*tmp++ = *src++;

	for (; count; count--)
		*tmp++ = '\0';

	return dest;
}

This is probably the version I would choose if I were to decide
as the object code generated for the actual loops are optimal in
this version (at least for CRIS).

> This reminds me a lot of the ORIGINAL, although I didn't pay much 
> attention to it at the time, so I don't remember.  It may be that
> the original had "n--" in the while () condition of the first 
> loop, rather than inside the loop.
> 
> I THINK the original complaint was that n would be off by 1 
> upon exiting the first loop.  The fix is to only decrement n 
> when n is nonzero.
> 
> If s2 is short enough, then we'll exit the first loop on the nul byte 
> and fill in the rest in the second loop.  Since n is only decremented 
> with we actually write to s, we will only ever write n bytes.  No 
> off-by-one.
> 
> If s2 is too long, the first loop will exit on n being zero, 
> and since it doesn't get decremented in that case, it'll be 
> zero upon entering the second loop, thus bypassing it properly.
> 
> Erik's code is actually quite elegant, and its efficiency is probably 
> essentially the same as my first loop.  But my second loop would 
> probably be faster at doing the zero fill.
> 
> 
> Now, consider this for the second loop!
> 
> 	while (n&3) {

I think sizeof(int)-1 would be better than 3. ;)
And long would probably be better for the 64-bit architectures?

> 		*s++ = 0;
> 		n--;
> 	}
> 	l = n>>2;
> 	while (l--) {
> 		*((int *)s)++ = 0;
> 	}
> 	n &= 3;
> 	while (n--) {
> 		*s++ = 0;
> 	}
> 
> This is only a win for relatively long nul padding.  How often is the 
> padding long enough?

I guess another way would be to replace the second loop with
memset(s, '\0', n), but that would probably only be a win for
quite long paddings.

//Peter

^ permalink raw reply	[flat|nested] 37+ messages in thread
* Re: generic strncpy - off-by-one error
@ 2003-08-13  3:09 Anthony Truong
  0 siblings, 0 replies; 37+ messages in thread
From: Anthony Truong @ 2003-08-13  3:09 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Albert Cahalan, andersen, linux-kernel mailing list, bernd, alan,
	schwab, ysato, Valdis.Kletnieks, william.gallafent

On Wed, 2003-08-13 at 13:18, Willy Tarreau wrote:

On Tue, Aug 12, 2003 at 11:38:31PM -0400, Albert Cahalan wrote:
> That's excellent. On ppc I count 12 instructions,
> 4 of which would go away for typical usage if inlined.
> Annoyingly, gcc doesn't get the same assembly from my
> attempt at that general idea:
> 
> char * strncpy_5(char *dest, const char *src, size_t count){
>   char *tmp = dest;
>   while (count--){
>     if(( *tmp++ = *src )) src++;
>   }
>   return dest;
> }
> 
> I suppose that gcc could use a bug report.

I often noticed that using '++' and '--' within or just before
assignments
and/or comparisons often break the code and make it suboptimal. C
provides
enough flexibility to code what you think nearly at the instruction
level.
Since 'while' loops often start with a jump to the end, you can
sometimes help
the compiler by enclosing them within an 'if' statement such as below.
BTW, in
your case, count ends with -1.

I've absolutely not tried this one, but it could produce different code
on your
PPC, and can trivially be derived to cleaner constructs. I proceeded the
same
way when I wrote my own optimized strlcpy() implementation which is 45
bytes
long and copies 1 char per CPU cycle on i686.

char *strncpy(char *dest, const char *src, size_t count)
{
   if (count) {
      char *tmp = dest;
      while (1) {
         *tmp = *src;
         if (*src) src++;
         tmp++;
         if (!count--) break;
      }
   }
   return dest;
}

Cheers,
Willy


Hello,
This reminds me of my days in school learning how to program in C/C++. 
I first learnt that in
  while (count-- && (*dest++ = *src++));
  the first operand of && will be evaluated first, and then the second
  iff the first is evaluated to TRUE.
  count-- : count will be checked to see if it is not 0 first, and if it
  is not, it will get decremented.  If it is, the while loop is ended.
  Therefore, how could count go to -1?
If there is a bug in the gcc, then should we fix it rather than messing 
with our coding trying to please it?
We don't have to do if (count) before the loop because a count of 0 is 
already caught in while (count-- ......).

Again, maybe what I learnt in school about C/C++ programming is all
wrong.:-)

:-)
Anthony Dominic Truong.




Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation.  This transmission is intended only for the use of the
addressee(s) listed above.  Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.



^ permalink raw reply	[flat|nested] 37+ messages in thread
* Re: generic strncpy - off-by-one error
@ 2003-08-13  2:18 Albert Cahalan
  2003-08-13  2:47 ` Erik Andersen
  0 siblings, 1 reply; 37+ messages in thread
From: Albert Cahalan @ 2003-08-13  2:18 UTC (permalink / raw)
  To: linux-kernel mailing list
  Cc: bernd, Anthony.Truong, alan, schwab, ysato, willy,
	Valdis.Kletnieks, william.gallafent

You're all wrong. This is some kind of programming
test for sure!

Let us imagine that glibc has a correct version.
By exhaustive testing, I found a version that works.
Here it is, along with the test code:

//////////////////////////////////////////////////////
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

// first correct implementation!
char * strncpy_good(char *dest, const char *src, size_t count){
  char *tmp = dest;
  memset(dest,'\0',count);
  while (count-- && (*tmp++ = *src++))
    ;
  return dest;
}

static char ref[32];
static char hmm[32];

static char s0[] = "";
static char s1[] = "1";
static char s2[] = "12";
static char s6[] = "123456";
static char s7[] = "1234567";
static char s8[] = "12345678";
static char s9[] = "123456789";

static void tester(const char *src, size_t count){
  memset(ref,'%',sizeof ref);
  memset(hmm,'%',sizeof hmm);
  strncpy     (ref+2,src,count);
  strncpy_good(hmm+2,src,count);
  if(memcmp(ref,hmm,sizeof hmm)){
    printf("error @ size %d\n",(int)count);
  }
}  
   
int main(int argc, char *argv[]){
  int i = 15;
  while(i--){
    tester(s0,i);
    tester(s1,i);
    tester(s2,i);
    tester(s6,i);
    tester(s7,i);
    tester(s8,i);
    tester(s9,i);
  }
  return 0;
}
////////////////////////////////////////////////////////////



^ permalink raw reply	[flat|nested] 37+ messages in thread
* generic strncpy - off-by-one error
@ 2003-08-12 14:07 Yoshinori Sato
  2003-08-12 14:39 ` Willy Tarreau
  2003-08-12 14:50 ` Yoshinori Sato
  0 siblings, 2 replies; 37+ messages in thread
From: Yoshinori Sato @ 2003-08-12 14:07 UTC (permalink / raw)
  To: linux kernel Mailing List

zero fill count is off-by-one error

--- lib/string.c~	2003-08-09 20:30:36.000000000 +0900
+++ lib/string.c	2003-08-12 22:55:47.000000000 +0900
@@ -89,7 +89,8 @@
 
 	while (count && (*dest++ = *src++) != '\0')
 		count--;
-	while (count) {
+	count--;
+	while (count > 0) {
 		*dest++ = 0;
 		count--;
 	}

-- 
Yoshinori Sato
<ysato@users.sourceforge.jp>

^ permalink raw reply	[flat|nested] 37+ messages in thread
* Re: generic strncpy - off-by-one error
@ 2003-08-12  1:56 Anthony Truong
  2003-08-12 17:14 ` Alan Cox
  0 siblings, 1 reply; 37+ messages in thread
From: Anthony Truong @ 2003-08-12  1:56 UTC (permalink / raw)
  To: Bernd Petrovitsch; +Cc: linux kernel Mailing List

On Wed, 2003-08-13 at 00:24, Bernd Petrovitsch wrote:

>I don't see any problem with this code, and if we don't need to
>NULL-pad the dest string, we do not have to.  It is not in the
>definition of strncpy(). 

Wrong: http://www.opengroup.org/onlinepubs/007908799/xsh/strncpy.html

        Bernd

Hello,
Thanks for pointing that out to me.  However, I don't think the kernel
strncpy implementation is exactly the same as that of Standard C lib
implementation.  Iwas just looking at it from the kernel code context. 
There's a point in doing it the "kernel" way, to save precious CPU
cycles from being wasted otherwise.

Regards,
Anthony Dominic Truong.



Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation.  This transmission is intended only for the use of the
addressee(s) listed above.  Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.



^ permalink raw reply	[flat|nested] 37+ messages in thread
* Re: generic strncpy - off-by-one error
@ 2003-08-12  1:28 Anthony Truong
  2003-08-12 16:24 ` Bernd Petrovitsch
  0 siblings, 1 reply; 37+ messages in thread
From: Anthony Truong @ 2003-08-12  1:28 UTC (permalink / raw)
  To: William Gallafent
  Cc: Valdis.Kletnieks, Yoshinori Sato, linux kernel Mailing List

On Tue, 2003-08-12 at 23:54, William Gallafent wrote:

On Tue, 12 Aug 2003 Valdis.Kletnieks@vt.edu wrote:

> On Tue, 12 Aug 2003 23:50:06 +0900, Yoshinori Sato
> <ysato@users.sourceforge.jp> said:
> > -   while (count) {
> > +   while (count > 1) {
>
> Given that count is a size_t, which seems to be derived from 'unsigned
int'
> or 'unsigned long' on every platform, how are these any different?

Er, consider the case of count == 1. Fenceposts can be dangerous things.

-- 
Bill Gallafent.
-


Hello,
This is the code I got from 2.4.20:
char * strncpy(char * dest,const char *src,size_t count)
{
	char *tmp = dest;

	while (count-- && (*dest++ = *src++) != '\0')
		/* nothing */;

	return tmp;
}

I don't see any problem with this code, and if we don't need to NULL-pad
the dest string, we do not have to.  It is not in the definition of
strncpy().  So we don't need the second while {};
I'm hoping we're looking at the same thing.

Regards,
Anthony Dominic Truong.




Disclaimer: The information contained in this transmission, including any
attachments, may contain confidential information of Matsushita Avionics
Systems Corporation.  This transmission is intended only for the use of the
addressee(s) listed above.  Unauthorized review, dissemination or other use
of the information contained in this transmission is strictly prohibited.
If you have received this transmission in error or have reason to believe
you are not authorized to receive it, please notify the sender by return
email and promptly delete the transmission.



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2003-08-20  7:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-20  7:43 generic strncpy - off-by-one error Peter Kjellerstedt
  -- strict thread matches above, loose matches on Subject: below --
2003-08-16 21:10 Peter Kjellerstedt
2003-08-18 18:41 ` Timothy Miller
2003-08-16 20:08 Peter Kjellerstedt
2003-08-16  9:19 Peter Kjellerstedt
2003-08-16 10:04 ` Daniel Forrest
2003-08-18 16:40   ` Timothy Miller
2003-08-16  8:15 Peter Kjellerstedt
2003-08-16  8:41 ` Daniel Forrest
2003-08-18 16:17   ` Timothy Miller
2003-08-18 16:06 ` Timothy Miller
2003-08-15  9:54 Peter Kjellerstedt
2003-08-15 17:52 ` Timothy Miller
2003-08-15  9:53 Peter Kjellerstedt
2003-08-15 17:47 ` Timothy Miller
2003-08-14  9:34 Peter Kjellerstedt
2003-08-14 19:45 ` Willy Tarreau
2003-08-14 20:24 ` Timothy Miller
2003-08-13  3:09 Anthony Truong
2003-08-13  2:18 Albert Cahalan
2003-08-13  2:47 ` Erik Andersen
2003-08-13  3:38   ` Albert Cahalan
2003-08-13  3:56     ` Nick Piggin
2003-08-13  5:18     ` Willy Tarreau
2003-08-13 19:03   ` Timothy Miller
2003-08-12 14:07 Yoshinori Sato
2003-08-12 14:39 ` Willy Tarreau
2003-08-12 14:50 ` Yoshinori Sato
2003-08-12 15:03   ` Valdis.Kletnieks
2003-08-12 15:54     ` William Gallafent
2003-08-12 16:19       ` Valdis.Kletnieks
2003-08-12 16:09     ` Andreas Schwab
2003-08-12  1:56 Anthony Truong
2003-08-12 17:14 ` Alan Cox
2003-08-12 21:53   ` Bernd Petrovitsch
2003-08-12  1:28 Anthony Truong
2003-08-12 16:24 ` Bernd Petrovitsch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).