linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* hmm..
@ 2003-12-22 20:10 John Dee
  2003-12-22 21:31 ` hmm Linus Torvalds
  2003-12-22 23:12 ` hmm Gene Heskett
  0 siblings, 2 replies; 8+ messages in thread
From: John Dee @ 2003-12-22 20:10 UTC (permalink / raw)
  To: linux-kernel

I know you guys have already probably seen this.. figured I'd share with 
the class, so the big kids can tear it apart.
http://lwn.net/Articles/64052/
enjoy.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-22 20:10 hmm John Dee
@ 2003-12-22 21:31 ` Linus Torvalds
  2003-12-22 23:10   ` hmm bert hubert
                     ` (2 more replies)
  2003-12-22 23:12 ` hmm Gene Heskett
  1 sibling, 3 replies; 8+ messages in thread
From: Linus Torvalds @ 2003-12-22 21:31 UTC (permalink / raw)
  To: John Dee; +Cc: linux-kernel



On Mon, 22 Dec 2003, John Dee wrote:
>
> I know you guys have already probably seen this.. figured I'd share with 
> the class, so the big kids can tear it apart.
> http://lwn.net/Articles/64052/

I spent half an hour tearing part of it apart for some journalists. No
guarantees for the full accuracy of this write-up, and in particular I
don't actually have "original UNIX" code to compare against, but the files
I checked (ctype.[ch]) definitely do not have any UNIX history to them.

The rest of the files are mostly errno.h/signal.h/ioctl.h (and they are 
apparently the 2.4.x versions, before we moved some common constants into 
"asm-generic/errno.h"), and while I haven't analyzed them, I know for a 
fact that

 - the original errno.h used different error numbers than "original UNIX"

   I know this because I cursed it later when it meant that doing things 
   like binary emulation wasn't as trivial - you had to translate the 
   error numbers.

 - same goes for "signal.h": while a lot of the standard signals are well 
   documented (ie "SIGKILL is 9"), historically we had lots of confusion 
   (ie I think "real UNIX" has SIGBUS at 10, while Linux didn't originally 
   have any SIGBUS at all, and later put it at 7 which was originally 
   SIGUNUSED.

So to me it looks like 

 - yes, Linux obviously has the same signal names and error number names 
   that UNIX has (so the files certainly have a lot of the same 
   identifiers)

 - but equally clearly they weren't copied from any "real UNIX". 

(Later, non-x86 architectures have tried harder to be binary-compatible 
with their "real UNIX" counter-parts, and as a result we have different 
errno header files for different architectures - and on non-x86 
architectures the numbers will usually match traditional UNIX).

For example, doing a "grep" for SIGBUS on the kernel shows that most
architectures still have SIGBUS at 7 (original Linux value), while alpha,
sparc, parisc and mips have it at 10 (to match "real UNIX").

What this tells me is that the original code never came from UNIX, but
some architectures later were made to use the same values as UNIX for
binary compatibility (I know this is true for alpha, for example: being
compatible with OSF/1 was one of my very early goals in that port).

In other words, I think we can totally _demolish_ the SCO claim that these 
65 files were somehow "copied". They clearly are not.

Which should come as no surprise to people. But I think it's nice to see 
just _how_ clearly we can show that SCO is - yet again - totally 
incorrect.

		Linus

----

For example, SCO lists the files "include/linux/ctype.h" and
"lib/ctype.h", and some trivial digging shows that those files are
actually there in the original 0.01 distribution of Linux (ie September of
1991). And I can state 

 - I wrote them (and looking at the original ones, I'm a bit ashamed: 
   the "toupper()" and "tolower()" macros are so horribly ugly that I 
   wouldn't admit to writing them if it wasn't because somebody else 
   claimed to have done so ;)

 - writing them is no more than five minutes of work (you can verify that 
   with any C programmer, so you don't have to take my word for it)

 - the details in them aren't even the same as in the BSD/UNIX files (the 
   approach is the same, but if you look at actual implementation details 
   you will notice that it's not just that my original "tolower/toupper"  
   were embarrassingly ugly, a number of other details differ too).

In short: for the files where I personally checked the history, I can
definitely say that those files are trivially written by me personally,
with no copying from any UNIX code _ever_.

So it's definitely not a question of "all derivative branches". It's a
question of the fact that I can show (and SCO should have been able to
see) that the list they show clearly shows original work, not "copied".


	Analysis of "lib/ctype.c" and "include/linux/ctype.h".


First, some background: the "ctype" name comes "character type", and the
whole point of "ctype.h" and "ctype.c" is to test what kind of character
we're dealing with. In other words, those files implement tests for doing
things like asking "is this character a digit" or "is this character an
uppercase letter" etc. So you can write thing like

	if (isdigit(c)) {
		.. we do something with the digit ..

and the ctype files implement that logic.

Those files exist (in very similar form) in the original Linux-0.01 
release under the names "lib/ctype.c" and "include/ctype.h". That kernel 
was released in September of 1991, and contains no code except for mine 
(and Lars Wirzenius, who co-wrote "kernel/vsprintf.c").

In fact, you can look at the files today and 12 years ago, and you can see 
clearly that they are largely the same: the modern files have been cleaned 
up and fix a number of really ugly things (tolower/toupper works 
properly), but they are clearly incremental improvement on the original 
one.

And the original one does NOT look like the unix source one. It has 
several similarities, but they are clearly due to:

 - the "ctype" interfaces are defined by the C standard library.

 - the C standard also specifies what kinds of names a system library 
   interface can use internally. In particular, the C standard specifies 
   that names that start with an underscore and a capital letter are 
   "internal" to the library. This is important, because it explains why
   both the Linux implementation _and_ the UNIX implementation used a
   particular naming scheme for the flags.

 - algorithmically, there aren't that many ways to test whether a 
   character is a number or not. That's _especially_ true in
   C, where a macro must not use it's argument more than once. So for 
   example, the "obvious" implementation of "isdigit()" (which tests for 
   whether a character is a digit or not) would be

	#define isdigit(x) ((x) >= '0' && (x) <= '9')

   but this is not actually allowed by the C standard (because 'x' is used 
   twice).

   This explains why both Linux and traditional UNIX use the "other" 
   obvious implementation: having an array that describes what each of the 
   possible 256 characters are, and testing the contents of that array
   (indexed by the character) instead. That way the macro argument is only 
   used once.

The above things basically explain the similarities. There simply aren't
that many ways to do a standard C "ctype" implementation, in other words.

Now, let's look at the _differences_ in Linux and traditional UNIX:

 - both Linux and traditional unix use a naming scheme of "underscore and 
   a capital letter" for the flag names. There are flags for "is upper 
   case" (_U) and "is lower case" (_L), and surprise surprise, both UNIX 
   and Linux use the same name. But think about it - if you wanted to use 
   a short flag name, and you were limited by the C standard naming, what 
   names _would_ you use? Maybe you'd select "U" for "Upper case" and "L" 
   for "Lower case"?

   Looking at the other flags, Linux uses "_D" for "Digit", while
   traditional UNIX instead uses "_N" for "Number". Both make sense, but 
   they are different. I personally think that the Linux naming makes more 
   sense (the function that tests for a digit is called "isdigit()", not
   "isnumber()"), but on the other hand I can certainly understand why 
   UNIX uses "_N" - the function that checs for whether a character is 
   "alphanumeric" is called "isalnum()", and that checks whether the 
   character is a upper case letter, a lower-case letter _or_ a digit (aka 
   "number").

   In short: there aren't that many ways you can choose the names, and 
   there is lots of overlap, but it's clearly not 100%.

 - The original Linux ctype.h/ctype.c file has obvious deficiencies, which 
   pretty much point to somebody new to C making mistakes (me) rather than 
   any old and respected source. For example, the "toupper()/tolower()"  
   macros are just totally broken, and nobody would write the "isascii()" 
   and "toascii()" the way they were written in that original Linux. And
   you can see that they got fixed later on in Linux development, even 
   though you can also see that the files otherwise didn't change.

   For example: remember how C macros must only use their argument once 
   (never mind why - you really don't care, so just take it on faith, for
   now). So let's say that you wanted to change an upper case character 
   into a lower case one, which is what "tolower()" does. Normal use is 
   just a fairly obvious

	newchar = tolower(oldchar);

   and the original Linux code does

	extern char _ctmp;
	#define tolower(c) (_ctmp=c,isupper(_ctmp)?_ctmp+('a'+'A'):_ctmp)

   which is not very pretty, but notice how we have a "temporary 
   character" _ctmp (remember that internal header names should start with
   an underscore and an upper case character - this is already slightly 
   broken in itself). That's there so that we can use the argument "c" 
   only once - to assign it to the new temporary - and then later on we 
   use that temporary several times.

   Now, the reason this is broken is 

    - it's not thread-safe (if two different threads try to do this at 
      once, they will stomp on each others temporary variable)

    - the argument (c) might be a complex expression, and as such it
      should really be parenthesized. The above gets several valid 
      (but unusual) expressions wrong.

Basically, the above is _exactly_ the kinds of mistakes a young programmer 
would make. It's classic.

And I bet it's _not_ what the UNIX code looked like, even in 1991. UNIX by
then was 20 years old, and I _think_ that it uses a simple table lookup
(which makes a lot more sense anyway and solves all problems). I'd be very
susprised if it had those kinds of "beginner mistakes" in it, but I don't 
actually have access to the code, so what do I know? (I can look up some 
BSD code on the web, it definitely does _not_ do anythign like the above).

The lack of proper parenthesis exists in other places of the original 
Linux ctype.h file too: isascii() and toascii() are similarly broken.

In other words: there are _lots_ of indications that the code was not 
copied, but was written from scratch. Bugs and all.

Oh, another detail: try searching the web (google is your friend) for 
"_ctmp". It's unique enough that you'll notice that all the returned hits 
are all Linux-related. No UNIX hits anywhere. Doing a google for

	_ctmp -linux

shows more Linux pages (that just don't happen to have "linux" in them),
except for one which is the L4 microkernel, and that one shows that they
used the Linux header file (it still says "_LINUX_CTYPE_H" in it).

So there is definitely a lot of proof that my ctype.h is original work.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-22 21:31 ` hmm Linus Torvalds
@ 2003-12-22 23:10   ` bert hubert
  2003-12-23  1:16   ` hmm Felipe Alfaro Solana
  2003-12-25  7:56   ` hmm Valdis.Kletnieks
  2 siblings, 0 replies; 8+ messages in thread
From: bert hubert @ 2003-12-22 23:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: John Dee, linux-kernel

On Mon, Dec 22, 2003 at 01:31:51PM -0800, Linus Torvalds wrote:

>  - yes, Linux obviously has the same signal names and error number names 
>    that UNIX has (so the files certainly have a lot of the same 
>    identifiers)

Even windows errno numbers often match the unix ones, btw.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-22 20:10 hmm John Dee
  2003-12-22 21:31 ` hmm Linus Torvalds
@ 2003-12-22 23:12 ` Gene Heskett
  1 sibling, 0 replies; 8+ messages in thread
From: Gene Heskett @ 2003-12-22 23:12 UTC (permalink / raw)
  To: John Dee, linux-kernel

On Monday 22 December 2003 15:10, John Dee wrote:
>I know you guys have already probably seen this.. figured I'd share
> with the class, so the big kids can tear it apart.
>http://lwn.net/Articles/64052/
>enjoy.
>-

I checked several of those files as far back as 2.4.18, the oldest 
that still exists on my machine.

None of those files I checked contain any credits to anybody.  And, 
noting the last message wherein the poster quoted contrary statements 
made by SCO many years ago, I'd make the assumption that any judge 
capable of common sense would toss this, with prejudice, meaning they 
cannot change a couple of words and refile the suit.  Or at least 
thats how I understand it.  OTOH, I'm not the judge.

-- 
Cheers, Gene
AMD K6-III@500mhz 320M
Athlon1600XP@1400mhz  512M
99.22% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attornies please note, additions to this message
by Gene Heskett are:
Copyright 2003 by Maurice Eugene Heskett, all rights reserved.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-22 21:31 ` hmm Linus Torvalds
  2003-12-22 23:10   ` hmm bert hubert
@ 2003-12-23  1:16   ` Felipe Alfaro Solana
  2003-12-25  7:56   ` hmm Valdis.Kletnieks
  2 siblings, 0 replies; 8+ messages in thread
From: Felipe Alfaro Solana @ 2003-12-23  1:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: John Dee, Linux Kernel Mailinglist

On Mon, 2003-12-22 at 22:31, Linus Torvalds wrote:

> In other words, I think we can totally _demolish_ the SCO claim that these 
> 65 files were somehow "copied". They clearly are not.

It seems they keep Smoking Crack (TM)...


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-22 21:31 ` hmm Linus Torvalds
  2003-12-22 23:10   ` hmm bert hubert
  2003-12-23  1:16   ` hmm Felipe Alfaro Solana
@ 2003-12-25  7:56   ` Valdis.Kletnieks
  2003-12-31  6:28     ` hmm Linus Torvalds
  2003-12-31 13:44     ` hmm viro
  2 siblings, 2 replies; 8+ messages in thread
From: Valdis.Kletnieks @ 2003-12-25  7:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2307 bytes --]

On Mon, 22 Dec 2003 13:31:51 PST, Linus Torvalds said:

>  - algorithmically, there aren't that many ways to test whether a 
>    character is a number or not. That's _especially_ true in
>    C, where a macro must not use it's argument more than once. So for 
>    example, the "obvious" implementation of "isdigit()" (which tests for 
>    whether a character is a digit or not) would be
> 
> 	#define isdigit(x) ((x) >= '0' && (x) <= '9')
> 
>    but this is not actually allowed by the C standard (because 'x' is used 
>    twice).

Somebody tell IBM that.  From the AIX 4.3.3 and 5.1 /usr/include/ctype.h:

#define _VALC(__c)              ((__c)>=0&&(__c)<=256)
#define _IS(__c,__m)            (__OBJ_DATA(__lc_ctype)->mask[__c] & __m)
#define isalpha(__a)    (_VALC(__a)?_IS(__a,_ISALPHA):0)
#define isalnum(__a)    (_VALC(__a)?_IS(__a,_ISALNUM):0)
#define iscntrl(__a)    (_VALC(__a)?_IS(__a,_ISCNTRL):0)
#define isdigit(__a)    (_VALC(__a)?_IS(__a,_ISDIGIT):0)
#define isgraph(__a)    (_VALC(__a)?_IS(__a,_ISGRAPH):0)
#define islower(__a)    (_VALC(__a)?_IS(__a,_ISLOWER):0)
#define isprint(__a)    (_VALC(__a)?_IS(__a,_ISPRINT):0)
#define ispunct(__a)    (_VALC(__a)?_IS(__a,_ISPUNCT):0)
#define isspace(__a)    (_VALC(__a)?_IS(__a,_ISSPACE):0)
#define isupper(__a)    (_VALC(__a)?_IS(__a,_ISUPPER):0)
#define isxdigit(__a)   (_VALC(__a)?_IS(__a,_ISXDIGIT):0)
#define isascii(c)      (!((c) & ~0177))

You'd be *amazed* how far through memory a 'while (isalpha(*s++)) {..};' can go
(which in fact is how I discovered this blecherousness).

The AIX 4.3 support I contributed to  Sendmail 8.9.0 back in Feb 98 included a
work-around because IBM refused to fix it on the grounds that the VALC macro
was to protect against a SEGV if the macro was fed an 'int' rather than a
'char' (why they didn't just use 'mask[__c & 255]' is beyond me), and that you
only got hit if you compiled(*) with -D_ILS_MACROS.  At least IBM eventually fixed
isascii(), which was originally broken the same way.... 

Feel free to file this under "Code we can prove that IBM never contributed" :)

(*) The default is to use actual function calls due to locale considerations - building
with _ILS_MACROS provides a measured 30%+ CPU savings for Sendmail, which
doesn't care if it's nailed into a 'LANG=C' environ anyhow...





[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-25  7:56   ` hmm Valdis.Kletnieks
@ 2003-12-31  6:28     ` Linus Torvalds
  2003-12-31 13:44     ` hmm viro
  1 sibling, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2003-12-31  6:28 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel



On Thu, 25 Dec 2003 Valdis.Kletnieks@vt.edu wrote:
> 
> Somebody tell IBM that.  From the AIX 4.3.3 and 5.1 /usr/include/ctype.h:

Wow. 

And I thought _my_ code was crap. 

You have to be professional to mess up quite _that_ badly ;)

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: hmm..
  2003-12-25  7:56   ` hmm Valdis.Kletnieks
  2003-12-31  6:28     ` hmm Linus Torvalds
@ 2003-12-31 13:44     ` viro
  1 sibling, 0 replies; 8+ messages in thread
From: viro @ 2003-12-31 13:44 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Linus Torvalds, linux-kernel

On Thu, Dec 25, 2003 at 02:56:37AM -0500, Valdis.Kletnieks@vt.edu wrote:
> work-around because IBM refused to fix it on the grounds that the VALC macro
> was to protect against a SEGV if the macro was fed an 'int' rather than a
> 'char' (why they didn't just use 'mask[__c & 255]' is beyond me), and that you

Err...

a) is...() must be able to deal with any value that fits into unsigned char
and with EOF.  Behaviour on anything else is undefined, so their argument
is obviously bogus.

b) mask[__c & 255] is _not_ a solution, simply because EOF and 255 might
have different properties.  Doesn't apply to kernel, but our is...()
do not bother with EOF at all.  Userland ones have to.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2003-12-31 13:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-22 20:10 hmm John Dee
2003-12-22 21:31 ` hmm Linus Torvalds
2003-12-22 23:10   ` hmm bert hubert
2003-12-23  1:16   ` hmm Felipe Alfaro Solana
2003-12-25  7:56   ` hmm Valdis.Kletnieks
2003-12-31  6:28     ` hmm Linus Torvalds
2003-12-31 13:44     ` hmm viro
2003-12-22 23:12 ` hmm Gene Heskett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).