> -----Original Message-----
> From: Willy Tarreau [mailto:willy@w.ods.org] 
> Sent: Thursday, August 14, 2003 21:45
> To: Peter Kjellerstedt
> Cc: 'Timothy Miller'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
> 
> On Thu, Aug 14, 2003 at 11:34:50AM +0200, Peter Kjellerstedt wrote:
>  
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > 	char *tmp = dest;
> > 
> > 	while (count) {
> > 		if (*src == '\0')
> > 			break;
> > 
> > 		*tmp++ = *src++;
> > 		count--;
> > 	}
> > 
> > 	while (count) {
> > 		*tmp++ = '\0';
> > 		count--;
> > 	}
> > 
> > 	return dest;
> > }
> > 
> > Moving the check for *src == '\0' into the first loop seems
> > to let the compiler reuse the object code a little better
> > (may depend on the optimizer). Also note that your version
> > of the second loop needs an explicit  comparison with -1,
> > whereas mine uses an implicit comparison with 0.
> 
> Well, if you're at this level of comparison, then I may object that
> 'count' is evaluated twice when jumping from one loop to the 
> other one.
> 
> *Algorithmically* speaking, the most optimal code should be :
> 
> char *strncpy(char *dest, const char *src, size_t count)
> {
>         char *tmp = dest;
>         if (unlikely(!count))
>                 goto end;
> loop1:
>         if ((*tmp = *src++) == '\0')
> 		goto next;
>         tmp++;
>         if (likely(--count))
> 		goto loop1;
> 	else
> 		goto end;
> loop2:
>         *tmp = '\0';
> next:
>         tmp++;
>         if (likely(--count))
> 		goto loop2;
> end:
>         return dest;
> }

I hope we do not opt to go with this version. :)

> (sorry for those who hate gotos). Look at it : no test is 
> performed twice, no byte is read nor written twice in any 
> case. The assembly code produced is optimal on x86 (well, 
> in fact we could delete exactly one conditionnal jump because 
> the compiler doesn't know how to reuse them for several tests).
> 16 inlinable instructions (= excluding function in/out) which 
> can be executed 2 at a time if your CPU has 2 pipelines. about
> 3 cycles/copied byte, 2 cycles/filled byte.

It does not give optimal code for CRIS for the second loop.
It can easily be fixed by combining loop2 and next (which
will cause tmp to be assigned twice when changing loop),
but I know this was not the intent of this exercise.

> Unfortunately, it fools branch predictors and prefetch 
> mechanisms found in today's processors, so it results in 
> slower code than yours (at least on athlon and P3). Perhaps 
> it would be fast on older processors, I don't know.
> 
> All that demonstrates that whatever your efforts in helping 
> the optimizer, the only meaningful result is the benchmark. 
> Number of bytes and speculations on the reusability of 
> information between lines of codes are not what makes our
> programs fast anymore :-(

It was easier some years ago, wasn't it? ;)

> > Testing on the CRIS architecture, your version is 24 instructions,
> > whereas mine is 18. For comparison, Eric's one is 12 and the
> > currently used implementation is 26 (when corrected for the
> > off-by-one error by comparing with > 1 rather than != 0 in the
> > second loop).
> 
> Just out of curiosity, can the CRIS architecture execute 
> multiple instructions per cycle ? have you tried to determine 

No it cannot. :( Our ETRAX 100LX processor which uses the
CRIS architecture is a 100 MIPS processor without most of
today's fancy processor optimizations like multiple instructions,
branch prediction etc. It is designed for embedded products.

> the dependencies between the instructions ? Did you time the 
> different functions ? (yours is rather fast on x86 BTW).

No I didn't, but I have done it now.

I made a program that called the respective strncpy() 
implementation a number of times copying a 100 byte string
into a 1000 byte buffer. For each function I tested with
copying 50, 100, 200 and 1000 bytes. I have attached the 
source to this mail if anyone wants to make sure I did not
screw up the benchmarking too much... ;)

This table shows the times in seconds to call each function
500,000 times for CRIS:

                   50         100         200        1000
Original        3.123906    6.138616    9.146204   33.323051  
Eric's          2.602632    5.105367    9.637701   45.910798  
Timothy's       3.125003    6.128571    9.147905   33.325337  
My first        2.868396    5.626149    8.138134   28.276760  
My second (for) 2.622412    5.129988    7.636364   27.786745  
Algorithmic     2.597808    5.119332    7.627300   27.785999  

Going by these numbers, Eric's version is a good candidate 
for the copying phase, but loses out badly when it comes to
the filling. The best versions here seem to be my version
with the for loops and your algorithmic version which both
give almost identical numbers (but look below for more on
the algorithmic version).

This table shows the times in seconds to call each function
20,000,000 times for my P4 2.53GHz:

                   50         100         200        1000
Original        2.900444    5.475311    9.095701   37.462796  
Eric's          2.988404    5.637352    9.576857   37.580639  
Timothy's       2.929508    5.534824    9.157147   36.836899  
My first        2.951068    5.511169    8.321381   29.669017  
My second (for) 2.921675    5.531486    8.296728   28.940855  
Algorithmic     3.911937    7.343235   12.699275   54.288284  

The interesting thing here is that the original code wins
the copying phase. But the numbers for the copying phase for
the first five versions are so similar, that I doubt one can 
say one is better than the other. However, when it comes to
the filling phase my versions are back in the lead. These 
numbers also shows very clearly what you stated above about
branch predictions, as the algorithmic version loses out
badly here.

All these numbers taken together, I would say that my version
with the for loops seems to be the overall winner. :)

> To conclude, I would say that if we want to implement really 
> fast low-level primitives such as str*, mem*, ... there's 
> nothing better than assembly associated with benchmarks on 
> different CPU architectures and models.
> 
> But I don't know if strncpy() is used enough to justify that...

I agree

> Regards,
> Willy

//Peter