All of lore.kernel.org
 help / color / mirror / Atom feed
* [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
@ 2004-01-16 20:22 Timothy Miller
  2004-01-16 20:37 ` Valdis.Kletnieks
  0 siblings, 1 reply; 11+ messages in thread
From: Timothy Miller @ 2004-01-16 20:22 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Recently, I saw a slashdot article that pointed to a site dedicated to 
breaking MD5.  That is, so far, no one has found any two differing 
string which have the same MD5 cksum.  Logically, however, there WILL be 
collisions for any strings longer than the MD5 cksum itself -- we just 
haven't found any.  Well, there's some sort of contest where you can win 
money for breaking MD5 (I think).

Even further back, there was an LKML discussion about various sorts of 
compressing file systems.  One of the subthreads discussed identifying 
identical blocks (using MD5) and pointing them at the same physical 
block on disk.  Naturally, if there WERE two blocks with the same MD5, 
we'd want to check the raw data, just to be sure that there wasn't a 
false positive.

Think about it!  If we had a filesystem that actually DID this, and it 
was in the Linux kernel, it would spread far and wide.  It's bound to 
happen that someone will identify a collision.  We then report that to 
the committee offering the reward and then donate it to OSDL to help 
Linux development.



Yeah, I know... I'm a dork.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-16 20:22 [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL Timothy Miller
@ 2004-01-16 20:37 ` Valdis.Kletnieks
  2004-01-16 20:59   ` Timothy Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Valdis.Kletnieks @ 2004-01-16 20:37 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]

On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <miller@techsource.com>  said:

> Think about it!  If we had a filesystem that actually DID this, and it 
> was in the Linux kernel, it would spread far and wide.  It's bound to 
> happen that someone will identify a collision.  We then report that to 
> the committee offering the reward and then donate it to OSDL to help 
> Linux development.

Actually, it's *not* "bound to happen".  Figure out the number of blocks you'd
need to have even a 1% chance of a birthday collision in a 2**128 space.

And you'd need that many disk blocks on *a single system*.

Then figure out the chances of a collision on a small machine that only has 20
or 30 terabytes (yes, in this case terabytes is small).

The whole reason "use MD5 as a check for identical blocks" is useful is because
the chances of *that* going wrong are vanishingly small compared to the chances
that a memory stick will throw an undetected multiple-bit error, or that a RAID
controller will write blocks to the wrong disks, or any number of other things
that *do* happen in real life, but rarely enough that we don't bother writing
code to defend against them.



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-16 20:37 ` Valdis.Kletnieks
@ 2004-01-16 20:59   ` Timothy Miller
  2004-01-17 13:15     ` Bart Samwel
  0 siblings, 1 reply; 11+ messages in thread
From: Timothy Miller @ 2004-01-16 20:59 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Linux Kernel Mailing List



Valdis.Kletnieks@vt.edu wrote:
> On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <miller@techsource.com>  said:
> 
> 
>>Think about it!  If we had a filesystem that actually DID this, and it 
>>was in the Linux kernel, it would spread far and wide.  It's bound to 
>>happen that someone will identify a collision.  We then report that to 
>>the committee offering the reward and then donate it to OSDL to help 
>>Linux development.
> 
> 
> Actually, it's *not* "bound to happen".  Figure out the number of blocks you'd
> need to have even a 1% chance of a birthday collision in a 2**128 space.
> 
> And you'd need that many disk blocks on *a single system*.
> 
> Then figure out the chances of a collision on a small machine that only has 20
> or 30 terabytes (yes, in this case terabytes is small).

Certainly.  No one machine is going to find it in a reasonable period. 
OTOH, if a million machines were doing it, it increases the chances by 
just that much.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-16 20:59   ` Timothy Miller
@ 2004-01-17 13:15     ` Bart Samwel
  2004-01-20 19:21       ` Matthias Schniedermeyer
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Samwel @ 2004-01-17 13:15 UTC (permalink / raw)
  To: Timothy Miller, Valdis.Kletnieks; +Cc: Linux Kernel Mailing List

On Friday 16 January 2004 21:59, Timothy Miller wrote:
> Valdis.Kletnieks@vt.edu wrote:
> > On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <miller@techsource.com>  
said:
> >>Think about it!  If we had a filesystem that actually DID this, and it
> >>was in the Linux kernel, it would spread far and wide.  It's bound to
> >>happen that someone will identify a collision.  We then report that to
> >>the committee offering the reward and then donate it to OSDL to help
> >>Linux development.
> >
> > Actually, it's *not* "bound to happen".  Figure out the number of blocks
> > you'd need to have even a 1% chance of a birthday collision in a 2**128
> > space.
> >
> > And you'd need that many disk blocks on *a single system*.
> >
> > Then figure out the chances of a collision on a small machine that only
> > has 20 or 30 terabytes (yes, in this case terabytes is small).
>
> Certainly.  No one machine is going to find it in a reasonable period.
> OTOH, if a million machines were doing it, it increases the chances by
> just that much.

Let's take a look at the chances. 30 terabytes is, in a best-case scenario 
(with 512-byte blocks) about 6e10 blocks. That would be roughly 
6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the 
chances of a collision would be about 1e-9, disregarding the fact that all 
these machines have a large chance of containing similar blocks -- their data 
isn't truly random, so some blocks have a larger chance of occurring than 
others. The data sets on the machines are probably reasonably static, so if 
the collision isn't found *at once* the chances of it occurring later are 
much smaller. So, even under the most positive assumptions, with a hundred 
million machines with 30 terabytes of storage each, it's extremely probable 
that you won't find a collision. (A 96-bit hash could have been broken with 
this setup however. :) )

-- Bart

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-17 13:15     ` Bart Samwel
@ 2004-01-20 19:21       ` Matthias Schniedermeyer
  2004-01-21 11:46         ` Bart Samwel
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Matthias Schniedermeyer @ 2004-01-20 19:21 UTC (permalink / raw)
  To: Bart Samwel; +Cc: Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

On Sat, Jan 17, 2004 at 02:15:31PM +0100, Bart Samwel wrote:
> On Friday 16 January 2004 21:59, Timothy Miller wrote:
> > Valdis.Kletnieks@vt.edu wrote:
> > > On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <miller@techsource.com>  
> said:
> > >>Think about it!  If we had a filesystem that actually DID this, and it
> > >>was in the Linux kernel, it would spread far and wide.  It's bound to
> > >>happen that someone will identify a collision.  We then report that to
> > >>the committee offering the reward and then donate it to OSDL to help
> > >>Linux development.
> > >
> > > Actually, it's *not* "bound to happen".  Figure out the number of blocks
> > > you'd need to have even a 1% chance of a birthday collision in a 2**128
> > > space.
> > >
> > > And you'd need that many disk blocks on *a single system*.
> > >
> > > Then figure out the chances of a collision on a small machine that only
> > > has 20 or 30 terabytes (yes, in this case terabytes is small).
> >
> > Certainly.  No one machine is going to find it in a reasonable period.
> > OTOH, if a million machines were doing it, it increases the chances by
> > just that much.
> 
> Let's take a look at the chances. 30 terabytes is, in a best-case scenario 
> (with 512-byte blocks) about 6e10 blocks. That would be roughly 
> 6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the 
> chances of a collision would be about 1e-9, disregarding the fact that all 
> these machines have a large chance of containing similar blocks -- their data 
> isn't truly random, so some blocks have a larger chance of occurring than 
> others. The data sets on the machines are probably reasonably static, so if 
> the collision isn't found *at once* the chances of it occurring later are 
> much smaller. So, even under the most positive assumptions, with a hundred 
> million machines with 30 terabytes of storage each, it's extremely probable 
> that you won't find a collision. (A 96-bit hash could have been broken with 
> this setup however. :) )

There is one fundemental braino in the discussion.

Only HALF the bits are for preventing "accidental" collisions. (The
"birthday" thing). The rest is for preventing to "brute force" an input
that produces the same MD5.(*)

So MD5 has only 2**64 Bits against accidental collsions
Btw. I already had (a/the) MD5 collision(*2) in my life.

So you'd need SHA256 or SHA512 to be "really sure(tm)".



*: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
also true for MD5.

*2: I had a direcory of about 1,5 Million images and "md5sum"med them to
eliminate doubles. The Log-file, at one point, had the same md5sum as
one of the pictures.

Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-20 19:21       ` Matthias Schniedermeyer
@ 2004-01-21 11:46         ` Bart Samwel
  2004-01-22  0:12         ` Pavel Machek
  2004-01-22  2:36         ` Jamie Lokier
  2 siblings, 0 replies; 11+ messages in thread
From: Bart Samwel @ 2004-01-21 11:46 UTC (permalink / raw)
  To: Matthias Schniedermeyer
  Cc: Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

Matthias Schniedermeyer wrote:
> On Sat, Jan 17, 2004 at 02:15:31PM +0100, Bart Samwel wrote:
[...]
>>Let's take a look at the chances. 30 terabytes is, in a best-case scenario 
>>(with 512-byte blocks) about 6e10 blocks. That would be roughly 
>>6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the 
>>chances of a collision would be about 1e-9, disregarding the fact that all 
>>these machines have a large chance of containing similar blocks -- their data 
>>isn't truly random, so some blocks have a larger chance of occurring than 
>>others. The data sets on the machines are probably reasonably static, so if 
>>the collision isn't found *at once* the chances of it occurring later are 
>>much smaller. So, even under the most positive assumptions, with a hundred 
>>million machines with 30 terabytes of storage each, it's extremely probable 
>>that you won't find a collision. (A 96-bit hash could have been broken with 
>>this setup however. :) )
> 
> There is one fundemental braino in the discussion.
> 
> Only HALF the bits are for preventing "accidental" collisions. (The
> "birthday" thing). The rest is for preventing to "brute force" an input
> that produces the same MD5.(*)

 From RFC-1321:

"It is conjectured that the difficulty of coming up with two messages
having the same message digest is on the order of 2^64 operations,
and that the difficulty of coming up with any message having a given
message digest is on the order of 2^128 operations."

So, they say it's on the order of 2^64 operations, whatever their 
definition of "operation" is. It probably means that they think you need 
to take the MD5 of something in the order of 2^64 random messages in 
order to have a reasonable chance of finding a duplicate. The RFC says 
nothing about half of the bits being for one purpose and the other for 
another; in the algorithm all bits seem to be processed in a similar way.

Let's see where they got their 2^64. If you've got 2^64 messages, you've 
got (with the same logic as above) approximately 2^64 * 2^64 * (1 / 
2^128) = 100% chance of a birthday collision. This seems to support the 
idea that the kind of analysis that I just did corresponds to the way 
they look at it.

 > *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this
 > is also true for MD5.

It might be that you took the "2^64 operations" as meaning "64 bits" (or 
2^80 as 80 bits, in the SHA1 case), while it now seems to be the result 
of a birthday-collision calculation. Not a surprising misinterpretation 
BTW, because the RFC doesn't specify at all where they got this number.

> Btw. I already had (a/the) MD5 collision(*2) in my life.

> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.

Wow! It might be that MD5 is not such a good hash function after all 
then. The security is of course purely based on the hashes of messages 
being very randomly distributed. If they really were, the chances of 
this happening to you would have been extremely slim (less than 1e-20). 
I think the more probable explanation is that MD5 hashes aren't truly 
randomly distributed after all. :)

-- Bart

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-20 19:21       ` Matthias Schniedermeyer
  2004-01-21 11:46         ` Bart Samwel
@ 2004-01-22  0:12         ` Pavel Machek
  2004-01-22  8:29           ` Matthias Schniedermeyer
  2004-01-22  2:36         ` Jamie Lokier
  2 siblings, 1 reply; 11+ messages in thread
From: Pavel Machek @ 2004-01-22  0:12 UTC (permalink / raw)
  To: Matthias Schniedermeyer
  Cc: Bart Samwel, Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

Hi!

> There is one fundemental braino in the discussion.
> 
> Only HALF the bits are for preventing "accidental" collisions. (The
> "birthday" thing). The rest is for preventing to "brute force" an input
> that produces the same MD5.(*)
> 
> So MD5 has only 2**64 Bits against accidental collsions
> Btw. I already had (a/the) MD5 collision(*2) in my life.
> 
> So you'd need SHA256 or SHA512 to be "really sure(tm)".
> 
> 
> 
> *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
> also true for MD5.
> 
> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.

Do you have a copy? I believe *many* people would like to see that
one.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-20 19:21       ` Matthias Schniedermeyer
  2004-01-21 11:46         ` Bart Samwel
  2004-01-22  0:12         ` Pavel Machek
@ 2004-01-22  2:36         ` Jamie Lokier
  2004-01-22  8:51           ` Matthias Schniedermeyer
  2 siblings, 1 reply; 11+ messages in thread
From: Jamie Lokier @ 2004-01-22  2:36 UTC (permalink / raw)
  To: Matthias Schniedermeyer
  Cc: Bart Samwel, Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

Matthias Schniedermeyer wrote:
> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.

Something similar happened to me once.  Two different files with the
same result from md5sum.

When I ran md5sum again, it still reported the same results.

Then when I flushed the page cache and ran it again, it reported
different results.

I concluded it was a rare page cache corruption heisenbug.  Scary.

-- Jamie

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-22  0:12         ` Pavel Machek
@ 2004-01-22  8:29           ` Matthias Schniedermeyer
  0 siblings, 0 replies; 11+ messages in thread
From: Matthias Schniedermeyer @ 2004-01-22  8:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Bart Samwel, Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

On Thu, Jan 22, 2004 at 01:12:59AM +0100, Pavel Machek wrote:
> Hi!
> 
> > There is one fundemental braino in the discussion.
> > 
> > Only HALF the bits are for preventing "accidental" collisions. (The
> > "birthday" thing). The rest is for preventing to "brute force" an input
> > that produces the same MD5.(*)
> > 
> > So MD5 has only 2**64 Bits against accidental collsions
> > Btw. I already had (a/the) MD5 collision(*2) in my life.
> > 
> > So you'd need SHA256 or SHA512 to be "really sure(tm)".
> > 
> > 
> > 
> > *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
> > also true for MD5.
> > 
> > *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> > eliminate doubles. The Log-file, at one point, had the same md5sum as
> > one of the pictures.
> 
> Do you have a copy? I believe *many* people would like to see that
> one.

Unfortunatly not, and reconstruction is impossibel(tm). "Back then(more
than half a year ago)" i didn't see that as important.




Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
  2004-01-22  2:36         ` Jamie Lokier
@ 2004-01-22  8:51           ` Matthias Schniedermeyer
  0 siblings, 0 replies; 11+ messages in thread
From: Matthias Schniedermeyer @ 2004-01-22  8:51 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Bart Samwel, Timothy Miller, Valdis.Kletnieks, Linux Kernel Mailing List

On Thu, Jan 22, 2004 at 02:36:09AM +0000, Jamie Lokier wrote:
> Matthias Schniedermeyer wrote:
> > *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> > eliminate doubles. The Log-file, at one point, had the same md5sum as
> > one of the pictures.
> 
> Something similar happened to me once.  Two different files with the
> same result from md5sum.
> 
> When I ran md5sum again, it still reported the same results.
> 
> Then when I flushed the page cache and ran it again, it reported
> different results.
> 
> I concluded it was a rare page cache corruption heisenbug.  Scary.

I can 100% exclude Linux-Errors. The machine (still) runs with Solaris 8.



Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL
@ 2004-01-20 18:06 Clayton Weaver
  0 siblings, 0 replies; 11+ messages in thread
From: Clayton Weaver @ 2004-01-20 18:06 UTC (permalink / raw)
  To: linux-kernel

(re: md5 weakness)

The only document I've seen with a
rigorous demonstration of the
possibility of an md5 collision
created it by adding 0 (zero) bytes
to an input (so the colliding inputs
were not the same size in bytes).

Good luck finding a collision with
blocks that are all the same size.

Anyway, hash matching algorithms for
variable sized inputs (hashed extents,
etc) can probably get an additional several
orders of magnitude of safety by using
two hashes (md5 and sha1, for example).

What are the chances that the same two
different inputs that hash to the same
value using one of them collides in the
other, too? ("Left as an exercise for the ...")

Regards,

Clayton Weaver
<mailto: cgweav@email.com>

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-01-22  8:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-16 20:22 [OT] Redundancy eliminating file systems, breaking MD5, donating money to OSDL Timothy Miller
2004-01-16 20:37 ` Valdis.Kletnieks
2004-01-16 20:59   ` Timothy Miller
2004-01-17 13:15     ` Bart Samwel
2004-01-20 19:21       ` Matthias Schniedermeyer
2004-01-21 11:46         ` Bart Samwel
2004-01-22  0:12         ` Pavel Machek
2004-01-22  8:29           ` Matthias Schniedermeyer
2004-01-22  2:36         ` Jamie Lokier
2004-01-22  8:51           ` Matthias Schniedermeyer
2004-01-20 18:06 Clayton Weaver

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.