linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lockless file reading
@ 2003-08-27 12:37 Timo Sirainen
  2003-08-27 12:42 ` Martin Konold
  2003-08-28 10:54 ` Robin Rosenberg
  0 siblings, 2 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-27 12:37 UTC (permalink / raw)
  To: linux-kernel

(Maybe a bit off topic, but I couldn't get answers elsewhere and it is
about kernel behaviour)

The question is what can happen if I read() a file that's being
simultaneously updated by a write() in another process? If I'm writing
123 over XXX, is it possible that read() returns 1X3 in some conditions?
Is the behaviour filesystem specific? Any idea about other operating
systems?

I'm thinking about implementing lockless file reads that work like:

void write_data(int fd, off_t offset, void *data, size_t size) {
  lock_file(fd);
  pwrite(fd, data, size, offset); // or writev() or copy+single pwrite()
  pwrite(fd, data, size, offset + size);
  unlock_file(fd);
}

void read_data(int fd, off_t offset, void *data, size_t size) {
  unsigned char buf{size*2];
  for (;;) {
    pread(fd, buf, size*2, offset); // or shared mmap()ed access
    if (memcmp(buf, buf+size, size) == 0) break;
    usleep(10);
  }
  memcpy(data, buf, size);
}

The only case when I see that it would break is if we write "1212" but
pread() sees "1X1X" or "X2X2" there.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 12:37 Lockless file reading Timo Sirainen
@ 2003-08-27 12:42 ` Martin Konold
  2003-08-27 12:52   ` Timo Sirainen
  2003-08-28 10:08   ` Matthias Andree
  2003-08-28 10:54 ` Robin Rosenberg
  1 sibling, 2 replies; 56+ messages in thread
From: Martin Konold @ 2003-08-27 12:42 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: linux-kernel

Am Wednesday 27 August 2003 02:37 pm schrieb Timo Sirainen:

Hi,

> The question is what can happen if I read() a file that's being
> simultaneously updated by a write() in another process?

The behaviour is undefined.
 
> 123 over XXX, is it possible that read() returns 1X3 in some conditions?

Yes. The actual order stuff is written to the disk is not guaranteed.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 12:42 ` Martin Konold
@ 2003-08-27 12:52   ` Timo Sirainen
  2003-08-27 13:40     ` Richard B. Johnson
  2003-08-28 10:08   ` Matthias Andree
  1 sibling, 1 reply; 56+ messages in thread
From: Timo Sirainen @ 2003-08-27 12:52 UTC (permalink / raw)
  To: Martin Konold; +Cc: linux-kernel

On Wed, 2003-08-27 at 15:42, Martin Konold wrote:
> > The question is what can happen if I read() a file that's being
> > simultaneously updated by a write() in another process?
> 
> The behaviour is undefined.

So it's not such a good idea then? Hmh.. That'd solve a lot of problems
for me.
 
> > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> 
> Yes. The actual order stuff is written to the disk is not guaranteed.

It doesn't matter when it's actually written to disk, if it's only seen
by read().



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 12:52   ` Timo Sirainen
@ 2003-08-27 13:40     ` Richard B. Johnson
  2003-08-27 14:56       ` Timo Sirainen
  2003-08-28  0:03       ` Jamie Lokier
  0 siblings, 2 replies; 56+ messages in thread
From: Richard B. Johnson @ 2003-08-27 13:40 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Martin Konold, linux-kernel

On Wed, 27 Aug 2003, Timo Sirainen wrote:

> On Wed, 2003-08-27 at 15:42, Martin Konold wrote:
> > > The question is what can happen if I read() a file that's being
> > > simultaneously updated by a write() in another process?
> >
> > The behaviour is undefined.
>
> So it's not such a good idea then? Hmh.. That'd solve a lot of problems
> for me.
>
> > > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> >
> > Yes. The actual order stuff is written to the disk is not guaranteed.
>
> It doesn't matter when it's actually written to disk, if it's only seen
> by read().
>


> 123 over XXX, is it possible that read() returns 1X3 in some conditions?

I'm going to take this question literly. You are asking
if the middle-byte of a 3-byte sequence can be residual,
not yet updated.

Let's see if it is possible for the middle byte of
a 3-byte sequence to not be written when both
other bytes are written:

In the first place, everything within a buffer or
sector will be written in-order, i.e., if you have
the two end bytes, you must have the middle byte.
It's only the end-bytes that can stop anywhere.

So we look at the end of a buffer condition.

             |End of buffer
XXXXXXXXXXXXXXXXXXXXXXXXXXX <-- original data
             123            <-- new data
              |____ Write a byte
              ||___ Write a word
              ||||_ Write a long

Clearly, regardless of how the bytes are written, if you
get a '3', but not a two, the next write must have started
at offset 1, not offset 0. So, whatever write sequence
is done internally must subsequently seek backwards to offset
zero. This is highly unlikely (although possible).

Even in machines that do load/store operations where individual
components of those stores happen in groups, access to/from
a buffer of such data is controlled (by hardware) so a write
will complete before a read occurs. So if a data element that
is at a higher address has been modified as well as a data element
at a lower address, either the hardware is capable of byte access
or the center element must also have been modified. With hardware
that can perform byte-access (ix86), the only byte-access that
is going to happen is at the end(s) of buffers.

Given the 'famous' byte-sequence 0x12345678, the following incomplete
updates are possible (big endian):

0x123456xx   Byte access
0x1234xxxx   Word access
0xxx345678   Byte access
0xxxxx5678   Word access

Given the 'famous' byte-sequence 0x12345678, the following incomplete
updates are possible (little endian):

0x875634xx	Byte access
0x8756xxxx	Word access
0xxx563412	Byte access
0xxxxx3412	Word access

So I don't see how you could ever have a sequence of 123 written,
with both '1' and '3' written, but not '2'. It is only the stuff
on the 'ends' that can be incomplete.

Anyway, if you want two (or more) processes to access the file,
you should mmap it. You can configure a mmap'ed file so that
updates appear to all readers. However, just like any shared-memory
access, you need some kind of synchronization, perhaps a semaphore,
so that you always read valid data. Usually one only needs
__valid__ data, not necessarily __current__ data. For instance,
I have one task writing incremental numbers to a specific offset
in a file (shared memory). If it's okay for the reader to get
a non-garbage number, even though it's not the latest instantaneous
number, then if you stay away from multi-part numbers (like long long),
your data will always be "good" with no locking at all. That's because
the write will complete before the CPU gets taken away and given to a
reader. What is not guaranteed is that the data are current. You need a
semaphore for that.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 13:40     ` Richard B. Johnson
@ 2003-08-27 14:56       ` Timo Sirainen
  2003-08-27 23:39         ` Jamie Lokier
  2003-08-28  0:03       ` Jamie Lokier
  1 sibling, 1 reply; 56+ messages in thread
From: Timo Sirainen @ 2003-08-27 14:56 UTC (permalink / raw)
  To: root; +Cc: Martin Konold, linux-kernel

On Wednesday, Aug 27, 2003, at 16:40 Europe/Helsinki, Richard B. 
Johnson wrote:

> So I don't see how you could ever have a sequence of 123 written,
> with both '1' and '3' written, but not '2'. It is only the stuff
> on the 'ends' that can be incomplete.

That's pretty much what I was assuming without knowing how the kernel 
internally really works.

> Anyway, if you want two (or more) processes to access the file,
> you should mmap it. You can configure a mmap'ed file so that
> updates appear to all readers. However, just like any shared-memory
> access, you need some kind of synchronization, perhaps a semaphore,
> so that you always read valid data. Usually one only needs
> __valid__ data, not necessarily __current__ data.

Right, that's why I don't really need read locking. The 
double-writing/reading with memcmp() checking was supposed to check 
that the data is valid.

I'm already using shared mmaps, but I was thinking that supporting NFS 
would be nice as well. That'd work pretty much the same as write()s.

Maybe it would be possible to use some kind of error detection 
checksums which would guarantee that the data either is valid or isn't, 
regardless of the order in which it is written. I don't really know how 
that could be done though.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  0:52           ` Timo Sirainen
@ 2003-08-27 18:42             ` Nagendra Singh Tomar
  2003-08-28  8:40               ` Timo Sirainen
  2003-08-28  8:42               ` Valdis.Kletnieks
  2003-08-28  1:50             ` Jamie Lokier
  1 sibling, 2 replies; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-27 18:42 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Jamie Lokier, root, Martin Konold, linux-kernel

Hi,
    I beleive ur original post was to address the case of a reader reading 
a file getting *incorrect* data due to the file being written 
simultaneously by another writer process.
Why do u require file locking if there is a *single* writer ?? I don't 
understand why a 123 written over XXX can result in 1X3. The kernel should 
take care of this. When the writer process is writing 123 it will first be 
written to the page cache. The page cache lock will be taken inside the 
kernel before writing to it, so we know that writing 123 over XXX will be 
atomic.  Now even when this page is flushed to disk, the page lock would 
be taken. So I cannot see a possibility of 123 written over XXX being read 
as 1X3.
 File locking is reqd to synchronize between *multiple* writers as 
their writes can get interspersed. process A writing XXXX... and process B 
writing YYYY.. can result in XXXYYYXY (such a fine intersperse is though 
unlikely). 
I am not clear whether the proposed read_data/write_data/verify are in the 
user 
level or inside the kernel, but I assume that they are in the user level 
called by the writer and reader processes.
Write re-ordering can take place at user level (one writer processes 
calling write() before other process) and/or inside the kernel (based on 
the order in which the page lock was grabbed), while the checksum trick 
can (to some extent) address user space re-ordering, it cannot address 
kernel level reordering.



Thanx
tomar




 On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 02:39, Jamie Lokier wrote:
> > Timo Sirainen wrote:
> > > Maybe it would be possible to use some kind of error detection 
> > > checksums which would guarantee that the data either is valid or
> isn't, 
> > > regardless of the order in which it is written. I don't really know
> how 
> > > that could be done though.
> > 
> > You can use a strong checksum like MD5, if that is really faster than
> > locking.  (Over NFS it probably is faster than fcntl()-locking, for
> > small data blocks).
> 
> While MD5 is probably good enough, it doesn't _guarantee_ the
> consistency. I just thought of a simple algorithm that I think would.
> I'll go and use that unless someone proves it wrong :)
> 
> Except of course if there's 256 writes and the last one is non-ordered
> and it all happens while a read() is executing.. Less than unlikely I'd
> say.
> 
> I don't think there's any other potential problems with read() than that
> it may not have all data up to date? Such as it would never temporarily
> show the data as zero?
> 
> static int verify(const unsigned char *buf, size_t size)
> {
> 	const unsigned char *checksum = buf + size;
> 	unsigned char xor;
> 	size_t i;
> 
> 	xor = buf[0] ^ checksum[0];
> 	for (i = 1; i < size; i++) {
> 		if ((buf[i] ^ xor) != checksum[i])
> 			return 0;
> 	}
> 	return 1;
> }
> 
> void write_data(const void *data, size_t size)
> {
> 	unsigned char *checksum = buf + size;
> 	unsigned char xor;
> 	size_t i;
> 
> 	memcpy(buf, data, size);
> 
> 	checksum[0]++;
> 	xor = buf[0] ^ checksum[0];
> 
> 	for (i = 1; i < size; i++)
> 		checksum[i] = buf[i] ^ xor;
> }
> 
> void read_data(void *data, size_t size)
> {
> 	unsigned char copy[size*2];
> 
> 	do {
> 		memcpy(copy, buf, size*2);
> 	} while (!verify(copy, size));
> 	memcpy(data, copy, size);
> }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  8:40               ` Timo Sirainen
@ 2003-08-27 21:15                 ` Nagendra Singh Tomar
  2003-08-28  9:35                   ` Timo Sirainen
  2003-08-28  9:17                 ` David Schwartz
  1 sibling, 1 reply; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-27 21:15 UTC (permalink / raw)
  To: Timo Sirainen
  Cc: Tomar, Nagendra, Jamie Lokier, root, Martin Konold, linux-kernel


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Wed, 2003-08-27 at 21:42, Nagendra Singh Tomar wrote:
> > Hi,
> >     I beleive ur original post was to address the case of a reader
> reading 
> > a file getting *incorrect* data due to the file being written 
> > simultaneously by another writer process.
> 
> Well, "old" data, which mixed with new data would become incorrect as a
> whole.

What is this mixing we are talking of ??

> 
> > Why do u require file locking if there is a *single* writer ?? I don't
> 
> > understand why a 123 written over XXX can result in 1X3. The kernel
> should 
> > take care of this. When the writer process is writing 123 it will
> first be 
> > written to the page cache. The page cache lock will be taken inside
> the 
> > kernel before writing to it, so we know that writing 123 over XXX will
> be 
> > atomic.  Now even when this page is flushed to disk, the page lock
> would 
> > be taken. So I cannot see a possibility of 123 written over XXX being
> read 
> > as 1X3.
> 
> That was my original plan, to just rely on such kernel behaviour. I just
> don't know if it's such a good idea to rely on that, especially if I
> want to keep my program portable. I'll probably fallback to that anyway
> if my checksumming ideas won't work.

But I don't see any problem with a single writer and >=1 reader. There is 
no question of portability.

tomar
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  9:35                   ` Timo Sirainen
@ 2003-08-27 21:52                     ` Nagendra Singh Tomar
  2003-08-28 13:26                       ` Matthias Andree
  0 siblings, 1 reply; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-27 21:52 UTC (permalink / raw)
  To: Timo Sirainen
  Cc: Tomar, Nagendra, Jamie Lokier, root, Martin Konold, linux-kernel


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 00:15, Nagendra Singh Tomar wrote:
> > > >     I beleive ur original post was to address the case of a reader
> > > reading 
> > > > a file getting *incorrect* data due to the file being written 
> > > > simultaneously by another writer process.
> > > 
> > > Well, "old" data, which mixed with new data would become incorrect
> as a
> > > whole.
> > 
> > What is this mixing we are talking of ??
> 
> I can easily see that the data might be divided into two separate pages
> and between updating those pages, another process would have read the
> first page, but not the second page. That would result in a mixed old
> and new data. Probably just <old><new> or <new><old> instead of
> <new><old><new> what I was worrying about, but still it would be nicer
> to rely only on byte atomicity and read/write ordering. :)

Timo, I am at loss. OK, let me put across my understanding of the problem.
Thw writer is *updating* the file and not writing it for the first time,
which means that the file can be 1000 bytes long and the writer might be 
updating bytes 100 onwards. You are worried abt the reader reading 
partially old and partially new data. A question. 
If the writer does not want the reader to read old data why does'nt it 
truncate the file and start writing from the begining every time.
Please correct me if I am missing something significant. All this while I 
was thinking that the writer is writing to the file for the first time. 
i.e as he keeps writing the file size increases, so the reader is sure not 
to read anything else but what the writer writes.

> 
> > > That was my original plan, to just rely on such kernel behaviour. I
> just
> > > don't know if it's such a good idea to rely on that, especially if I
> > > want to keep my program portable. I'll probably fallback to that
> anyway
> > > if my checksumming ideas won't work.
> > 
> > But I don't see any problem with a single writer and >=1 reader. There
> is 
> > no question of portability.
> 
> There are of multiple writers, but it's fine for them to do locking
> between each others.
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28 10:26                       ` Timo Sirainen
@ 2003-08-27 22:58                         ` Nagendra Singh Tomar
  2003-08-28 12:18                           ` Jamie Lokier
  2003-08-28 12:01                         ` Jamie Lokier
  2003-08-28 20:24                         ` David Schwartz
  2 siblings, 1 reply; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-27 22:58 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: David Schwartz, Jamie Lokier, linux-kernel


On Thu, 28 Aug 2003, Timo Sirainen wrote:

> On Thu, 2003-08-28 at 12:56, David Schwartz wrote:
> > > > You said that MD5 wasn't strong enough, and you would like a
> guarantee.
> > 
> > > Yes. I don't really like it if my program heavily relies on
> something
> > > that can go wrong in some situations.
> > 
> > 	Okay, this is too much. Your alternative, assuming the kernel
> won't
> > re-order writes, is clearly relying on something that can go wrong.
> 
> Reorder on per-byte basis? Per-page/block would still be acceptable.
> 
> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?
> 
> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

No. Not possible. fsync() doesn't really make a difference here. Just that 
the first "12" gets its way into the disk. Th read will still read both 
"12"s from the cache (in very high possibility). Even if the page caches 
are stolen (due to mem shortage) the kernel file cachecing will ensure 
that you get consistent data.

> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 14:56       ` Timo Sirainen
@ 2003-08-27 23:39         ` Jamie Lokier
  2003-08-28  0:52           ` Timo Sirainen
  0 siblings, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-27 23:39 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: root, Martin Konold, linux-kernel

Timo Sirainen wrote:
> Maybe it would be possible to use some kind of error detection 
> checksums which would guarantee that the data either is valid or isn't, 
> regardless of the order in which it is written. I don't really know how 
> that could be done though.

You can use a strong checksum like MD5, if that is really faster than
locking.  (Over NFS it probably is faster than fcntl()-locking, for
small data blocks).

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 13:40     ` Richard B. Johnson
  2003-08-27 14:56       ` Timo Sirainen
@ 2003-08-28  0:03       ` Jamie Lokier
  2003-08-28 12:08         ` Richard B. Johnson
  1 sibling, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28  0:03 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Timo Sirainen, Martin Konold, linux-kernel

Richard B. Johnson wrote:
> Let's see if it is possible for the middle byte of
> a 3-byte sequence to not be written when both
> other bytes are written:

> Even in machines that do load/store operations where individual
> components of those stores happen in groups, access to/from
> a buffer of such data is controlled (by hardware) so a write
> will complete before a read occurs.

I don't understand what you mean by this.

Do you mean that the writes are forced to appear on a different CPU in
the same order that they were written?  That isn't true on x86, for
two reasons: 1. writes aren't always in processor order (see
CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
processor are out of order anyway.

> With hardware that can perform byte-access (ix86), the only
> byte-access that is going to happen is at the end(s) of buffers.

Not true.  Take a look at __copy_user() in arch/i386/lib/usercopy.c.
The first few bytes are copied using "rep;movsb", which is not
guaranteed to use a word write for the aligned pair, nor is it
guaranteed a particular timing (there could be an interrupt between
each byte).

Other architectures are similar.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 12:18                           ` Jamie Lokier
@ 2003-08-28  0:39                             ` Nagendra Singh Tomar
  2003-08-28 13:00                               ` Jamie Lokier
  2003-08-28 21:49                             ` Bernd Eckenfels
  1 sibling, 1 reply; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-28  0:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Tomar, Nagendra, Timo Sirainen, David Schwartz, linux-kernel


On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Nagendra Singh Tomar wrote:
> > > Or what about: write("12"), fsync(), write("12")? Is it still
> possible
> > > for read() to return "1x1x"?
> >
> > Th read will still read both "12"s from the cache (in very high
> possibility).
> 
> That is a terrible assumption.
> 
> On all the Linux architures, even single CPUs, with no fancy memory
> ordering, even 386, Pentiums etc., it is possible for read() to return
> "1x1x".
> 
> Like this: the write() is preempted (see CONFIG_PREEMPT) after writing
> one "1" byte, the read() runs, reads "1x" and is then preempted back
> to the writer, which writes "2", calls fsync(), writes "1" and is then
> preempted back to the reader, which continues and reads its second
> "1x".

kernel mode premption in Linux has still not sunk within me. All this 
while I was assuming that the kernel cannot be preempted. But I have one 
question:
While the write had "12" in its buffers and it  would have grabbed the 
page lock to write it into the page cache, won't it set some flag saying 
that I don't want to be prempted now. I think there is a small primitive 
for it in from 2.5 onwards. I don't think it will be a good idea to prempt 
while it is holding the page lock. How is it possible that it just wrote 
"1" and did not write "2" though it had grabbed the page lock for that 
purpose. 
I am yet to read the 2.5 kernel premption code.
Thanx for ur nice explaination though.

> 
> Interrupts caused by network packets arriving at just the wrong moment
> can trigger that sort of thing.
> 
> > Even if the page caches are stolen (due to mem shortage) the kernel
> > file cachecing will ensure that you get consistent data.
> 
> The kernel does not provide synchronisation between read() and write()
> data transfers, and it does not always use atomic 32-bit reads and
> writes either.


> 
> -- Jamie
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 23:39         ` Jamie Lokier
@ 2003-08-28  0:52           ` Timo Sirainen
  2003-08-27 18:42             ` Nagendra Singh Tomar
  2003-08-28  1:50             ` Jamie Lokier
  0 siblings, 2 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  0:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: root, Martin Konold, linux-kernel

On Thu, 2003-08-28 at 02:39, Jamie Lokier wrote:
> Timo Sirainen wrote:
> > Maybe it would be possible to use some kind of error detection 
> > checksums which would guarantee that the data either is valid or isn't, 
> > regardless of the order in which it is written. I don't really know how 
> > that could be done though.
> 
> You can use a strong checksum like MD5, if that is really faster than
> locking.  (Over NFS it probably is faster than fcntl()-locking, for
> small data blocks).

While MD5 is probably good enough, it doesn't _guarantee_ the
consistency. I just thought of a simple algorithm that I think would.
I'll go and use that unless someone proves it wrong :)

Except of course if there's 256 writes and the last one is non-ordered
and it all happens while a read() is executing.. Less than unlikely I'd
say.

I don't think there's any other potential problems with read() than that
it may not have all data up to date? Such as it would never temporarily
show the data as zero?

static int verify(const unsigned char *buf, size_t size)
{
	const unsigned char *checksum = buf + size;
	unsigned char xor;
	size_t i;

	xor = buf[0] ^ checksum[0];
	for (i = 1; i < size; i++) {
		if ((buf[i] ^ xor) != checksum[i])
			return 0;
	}
	return 1;
}

void write_data(const void *data, size_t size)
{
	unsigned char *checksum = buf + size;
	unsigned char xor;
	size_t i;

	memcpy(buf, data, size);

	checksum[0]++;
	xor = buf[0] ^ checksum[0];

	for (i = 1; i < size; i++)
		checksum[i] = buf[i] ^ xor;
}

void read_data(void *data, size_t size)
{
	unsigned char copy[size*2];

	do {
		memcpy(copy, buf, size*2);
	} while (!verify(copy, size));
	memcpy(data, copy, size);
}



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 13:00                               ` Jamie Lokier
@ 2003-08-28  1:06                                 ` Nagendra Singh Tomar
  0 siblings, 0 replies; 56+ messages in thread
From: Nagendra Singh Tomar @ 2003-08-28  1:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Tomar, Nagendra, Timo Sirainen, David Schwartz, linux-kernel


On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Nagendra Singh Tomar wrote:
> > While the write had "12" in its buffers and it  would have grabbed the
> 
> > page lock to write it into the page cache, won't it set some flag
> saying 
> > that I don't want to be prempted now. I think there is a small
> primitive 
> > for it in from 2.5 onwards. I don't think it will be a good idea to
> prempt 
> > while it is holding the page lock. How is it possible that it just
> wrote 
> > "1" and did not write "2" though it had grabbed the page lock for that
> 
> > purpose. 
> 
> Nope.  I don't see any disabling of preemption while the page is held.
> 
> It wouldn't make sense anyway, because the copies to/from userspace
> can sleep, so there's nothing to gain by disabling preemption.

I understand  what you say. Thanx.

> 
> -- Jamie
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  0:52           ` Timo Sirainen
  2003-08-27 18:42             ` Nagendra Singh Tomar
@ 2003-08-28  1:50             ` Jamie Lokier
  2003-08-28  3:17               ` Timo Sirainen
  1 sibling, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28  1:50 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: root, Martin Konold, linux-kernel

Timo Sirainen wrote:
> 	checksum[0]++;
> 	xor = buf[0] ^ checksum[0];

Your algorithm isn't going to work if the new value of xor is the same
as the old value of xor.

> 	do {
> 		memcpy(copy, buf, size*2);
> 	} while (!verify(copy, size));
> 	memcpy(data, copy, size);

It isn't safe because the memcpy() can read writes done on another
processor out of order, and xor does not always change.

You can read some of the newly written bytes in both the buf[] and
checksum[] halves of the buffer, while reading some of the previous
bytes in each half.  If the set of new bytes in the first half matches
the set in the second half well enough (i.e. the two sets match for
bytes which aredifferent between the old and new data blocks), and the
xor values are the same between the old and new data blocks, then your
test will accept a mix of old and new data bytes.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  1:50             ` Jamie Lokier
@ 2003-08-28  3:17               ` Timo Sirainen
  2003-08-28  6:01                 ` Valdis.Kletnieks
                                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  3:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: root, Martin Konold, linux-kernel

On Thursday, Aug 28, 2003, at 04:50 Europe/Helsinki, Jamie Lokier wrote:

>> 	checksum[0]++;
>> 	xor = buf[0] ^ checksum[0];
>
> Your algorithm isn't going to work if the new value of xor is the same
> as the old value of xor.

I was trying to prevent it with the checksum[0]++ .. but yes, you're 
right.

I'm sure someone has figured out a way to make a checksum of data that 
can detect if there's even a single bit wrong, if the checksum is 
allowed to take as much space as the data itself. I should read more 
about algorithms..

How about checksum[n] = data[n-1] ^ data[n]? That looks like it would 
work.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  3:17               ` Timo Sirainen
@ 2003-08-28  6:01                 ` Valdis.Kletnieks
  2003-08-28  6:13                 ` Jamie Lokier
  2003-08-28  9:13                 ` Martin Konold
  2 siblings, 0 replies; 56+ messages in thread
From: Valdis.Kletnieks @ 2003-08-28  6:01 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1179 bytes --]

On Thu, 28 Aug 2003 06:17:46 +0300, Timo Sirainen said:

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would 
> work.

Unfortunately, if you swap the contents of 'n' and 'n-1', the ^ remains the same,
as it's a commutative operation.

If you want to go down this route, you *really* want to use at least a CRC
here, and understand *why* the incremental computation of the IP header
checksum works at all (hint - the fact it's so simple puts a significant upper
bound on the error-detection strength of the checksum...)

1071 Computing the Internet checksum. R.T. Braden, D.A. Borman, C.
     Partridge. Sep-01-1988. (Format: TXT=54941 bytes) (Updated by
     RFC1141) (Status: UNKNOWN)

1141 Incremental updating of the Internet checksum. T. Mallory, A.
     Kullberg. Jan-01-1990. (Format: TXT=3587 bytes) (Updates RFC1071)
     (Updated by RFC1624) (Status: INFORMATIONAL)

1624 Computation of the Internet Checksum via Incremental Update. A.
     Rijsinghani, Ed.. May 1994. (Format: TXT=9836 bytes) (Updates
     RFC1141) (Status: INFORMATIONAL)

http://www.ietf.org/rfc/rfc1071.txt
http://www.ietf.org/rfc/rfc1141.txt
http://www.ietf.org/rfc/rfc1624.txt

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  3:17               ` Timo Sirainen
  2003-08-28  6:01                 ` Valdis.Kletnieks
@ 2003-08-28  6:13                 ` Jamie Lokier
  2003-08-28  8:57                   ` Timo Sirainen
  2003-08-28  9:13                 ` Martin Konold
  2 siblings, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28  6:13 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: root, Martin Konold, linux-kernel

Timo Sirainen wrote:
> I'm sure someone has figured out a way to make a checksum of data that 
> can detect if there's even a single bit wrong, if the checksum is 
> allowed to take as much space as the data itself. I should read more 
> about algorithms..

You said that MD5 wasn't strong enough, and you would like a guarantee.

You won't find a guarantee unless you are prepared to use memory
barriers in your code.  _Any_ checksum is going to have a chance of
false validation if you are doing out-of-order reads which can observe
parts of the old and new data, and parts of the old and new checksum.

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would 
> work.

Consider when the data {1,0,1,0} is replaced by {2,0,2,0}.

-- Jamie


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 18:42             ` Nagendra Singh Tomar
@ 2003-08-28  8:40               ` Timo Sirainen
  2003-08-27 21:15                 ` Nagendra Singh Tomar
  2003-08-28  9:17                 ` David Schwartz
  2003-08-28  8:42               ` Valdis.Kletnieks
  1 sibling, 2 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  8:40 UTC (permalink / raw)
  To: nagendra_tomar; +Cc: Jamie Lokier, root, Martin Konold, linux-kernel

On Wed, 2003-08-27 at 21:42, Nagendra Singh Tomar wrote:
> Hi,
>     I beleive ur original post was to address the case of a reader reading 
> a file getting *incorrect* data due to the file being written 
> simultaneously by another writer process.

Well, "old" data, which mixed with new data would become incorrect as a
whole.

> Why do u require file locking if there is a *single* writer ?? I don't 
> understand why a 123 written over XXX can result in 1X3. The kernel should 
> take care of this. When the writer process is writing 123 it will first be 
> written to the page cache. The page cache lock will be taken inside the 
> kernel before writing to it, so we know that writing 123 over XXX will be 
> atomic.  Now even when this page is flushed to disk, the page lock would 
> be taken. So I cannot see a possibility of 123 written over XXX being read 
> as 1X3.

That was my original plan, to just rely on such kernel behaviour. I just
don't know if it's such a good idea to rely on that, especially if I
want to keep my program portable. I'll probably fallback to that anyway
if my checksumming ideas won't work.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 18:42             ` Nagendra Singh Tomar
  2003-08-28  8:40               ` Timo Sirainen
@ 2003-08-28  8:42               ` Valdis.Kletnieks
  2003-08-28 20:13                 ` David B. Stevens
  1 sibling, 1 reply; 56+ messages in thread
From: Valdis.Kletnieks @ 2003-08-28  8:42 UTC (permalink / raw)
  To: nagendra_tomar
  Cc: Timo Sirainen, Jamie Lokier, root, Martin Konold, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2720 bytes --]

On Thu, 28 Aug 2003 00:12:30 +0530, Nagendra Singh Tomar said:

> Why do u require file locking if there is a *single* writer ?? I don't 
> understand why a 123 written over XXX can result in 1X3.

You'd be *amazed* at what can go strange (not wrong, just strange) at the
hardware level.  Let's assume for the sake of argument that  in your '123' and
'XXX', each character represents 4 or 8 or however many bytes wide your memory
bus is (so "123" is really a 24 byte string).  The following can happen
(especially if your hardware doesn't have cache coherency):

1) CPU 0 stores the 24 bytes, and it splits across a cache line boundary.
The '12' goes in one line, '3' in the next.

2) Cache controller 0 does the writeback of '3' first.

3) Cache controller 0 starts the writeback of the other cache line,
and gets the '1' written, still waiting for next memory cycle to write '2'.

4) CPU 1 or a DMA device snarfs up the 24 bytes before the '2' gets there.
'3' got there, '1' got there, '2' will get there the *NEXT* bus cycle.

1X3.  Whoops. 

Real hardware does this sort of thing all the time. Consider *this* gem from
the IBM S/370 Principles of Operation (GA22-7000-10, page 5-12):

"For the instructions EDIT, EDIT AND MARK, and TRANSLATE, the portions of the
operands that are actually used in the operation may be established in a trial
execution for operand accessibility that is performed before the execution of
the instruction is started.  This trial execution consists in an execution of
the instruction in which results are not stored.  If the first operand of
TRANSLATE or either operand of EDIT or EDIT AND MARK is changed by another CPU
or by a channel, after the initial trial execution but before completion of
execution, the contents of any fields due to be changed by the instruction are
unpredictable. Furthermore, it is unpredictable whether or not an interruption
occurs for an access exception that was not initially applicable".

In Linux terms, the S/370 family has a hardware instruction that does a
somewhat limited 'sprintf'.  It may make a dummy pass through the format string
to compute how long the result is (so it can tell if you should get a SEGV for
going off the end of a page) and it's possible for the format string to get
changed out from under it while it's doing that.  In addition, if the format
*used* to be a %4d that wrote to the last 4 bytes of a page, and somebody nails
it to be a %8d halfway through, we may or may not get a SEGV when we scribble 4
bytes onto the next page in memory, even if it's a readonly page we scribbled
on....

The description of what order things are seen to happen in runs for 11 pages,
*not* including special-cases like the above....







[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  6:13                 ` Jamie Lokier
@ 2003-08-28  8:57                   ` Timo Sirainen
  2003-08-28  9:56                     ` David Schwartz
  0 siblings, 1 reply; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  8:57 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: root, Martin Konold, linux-kernel

On Thu, 2003-08-28 at 09:13, Jamie Lokier wrote:
> Timo Sirainen wrote:
> > I'm sure someone has figured out a way to make a checksum of data that 
> > can detect if there's even a single bit wrong, if the checksum is 
> > allowed to take as much space as the data itself. I should read more 
> > about algorithms..
> 
> You said that MD5 wasn't strong enough, and you would like a guarantee.

Yes. I don't really like it if my program heavily relies on something
that can go wrong in some situations.

> You won't find a guarantee unless you are prepared to use memory
> barriers in your code.  _Any_ checksum is going to have a chance of
> false validation if you are doing out-of-order reads which can observe
> parts of the old and new data, and parts of the old and new checksum.

Not really. With {b1, b2, b1 xor b2} it doesn't matter what you read or
write first, no matter what the old data was. If they match, the result
is always either old or new.

If I want to get a checksum of 4 bytes then, I have to divide them into
two parts. Using the b1^b2 I can know if either one of them is valid,
but I can't know if they belong together.

Assuming that it's always either the previous one or the new one, I
think (once again :) that it's possible to check that by getting mixed
checksums: data = ABCD, c1 = A^B, c2 = C^D, c3 = A^C, c4 = B^D.

Except that the old data that read() sees could be even older than the
previous value. Maybe here works the growing xor-byte. I haven't thought
that far yet.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  3:17               ` Timo Sirainen
  2003-08-28  6:01                 ` Valdis.Kletnieks
  2003-08-28  6:13                 ` Jamie Lokier
@ 2003-08-28  9:13                 ` Martin Konold
  2003-08-28  9:27                   ` Timo Sirainen
  2 siblings, 1 reply; 56+ messages in thread
From: Martin Konold @ 2003-08-28  9:13 UTC (permalink / raw)
  To: Timo Sirainen, Jamie Lokier; +Cc: root, linux-kernel

Am Thursday 28 August 2003 05:17 am schrieb Timo Sirainen:

Hi Timo,

> How about checksum[n] = data[n-1] ^ data[n]? That looks like it would

I propose you first make some benchmarks and try to figure out how big the 
actual overhead of locking really is. I can easily assume that your 
"solution" is actually slower than a simple mechanism/semaphore. 

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28  8:40               ` Timo Sirainen
  2003-08-27 21:15                 ` Nagendra Singh Tomar
@ 2003-08-28  9:17                 ` David Schwartz
  1 sibling, 0 replies; 56+ messages in thread
From: David Schwartz @ 2003-08-28  9:17 UTC (permalink / raw)
  To: Timo Sirainen, nagendra_tomar
  Cc: Jamie Lokier, root, Martin Konold, linux-kernel


> That was my original plan, to just rely on such kernel behaviour. I just
> don't know if it's such a good idea to rely on that, especially if I
> want to keep my program portable. I'll probably fallback to that anyway
> if my checksumming ideas won't work.

	If you only have one writer, why not have the writer update an MD5 checksum
in the file along with the datawrite? If the reader sees an invalid
checksum, it repeats the read. This is simple, elegant, and foolproof. The
only possible flaw would be if you found two data sets with the same MD5
checksum. The instant fame would be well worth the inconvenience. ;)

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  9:13                 ` Martin Konold
@ 2003-08-28  9:27                   ` Timo Sirainen
  2003-08-28  9:48                     ` Martin Konold
  0 siblings, 1 reply; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  9:27 UTC (permalink / raw)
  To: Martin Konold; +Cc: Jamie Lokier, root, linux-kernel

On Thu, 2003-08-28 at 12:13, Martin Konold wrote:
> > How about checksum[n] = data[n-1] ^ data[n]? That looks like it would
> 
> I propose you first make some benchmarks and try to figure out how big the 
> actual overhead of locking really is. I can easily assume that your 
> "solution" is actually slower than a simple mechanism/semaphore. 

It's not about CPU usage. It's mostly about being able to modify the
file even when there's thousands of simultaneous readers that could
otherwise keep the file locked almost constantly.

Also it'd be nice to support NFS with .lock files since no-one really
uses lockd.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 21:15                 ` Nagendra Singh Tomar
@ 2003-08-28  9:35                   ` Timo Sirainen
  2003-08-27 21:52                     ` Nagendra Singh Tomar
  0 siblings, 1 reply; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28  9:35 UTC (permalink / raw)
  To: nagendra_tomar; +Cc: Jamie Lokier, root, Martin Konold, linux-kernel

On Thu, 2003-08-28 at 00:15, Nagendra Singh Tomar wrote:
> > >     I beleive ur original post was to address the case of a reader
> > reading 
> > > a file getting *incorrect* data due to the file being written 
> > > simultaneously by another writer process.
> > 
> > Well, "old" data, which mixed with new data would become incorrect as a
> > whole.
> 
> What is this mixing we are talking of ??

I can easily see that the data might be divided into two separate pages
and between updating those pages, another process would have read the
first page, but not the second page. That would result in a mixed old
and new data. Probably just <old><new> or <new><old> instead of
<new><old><new> what I was worrying about, but still it would be nicer
to rely only on byte atomicity and read/write ordering. :)

> > That was my original plan, to just rely on such kernel behaviour. I just
> > don't know if it's such a good idea to rely on that, especially if I
> > want to keep my program portable. I'll probably fallback to that anyway
> > if my checksumming ideas won't work.
> 
> But I don't see any problem with a single writer and >=1 reader. There is 
> no question of portability.

There are of multiple writers, but it's fine for them to do locking
between each others.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  9:27                   ` Timo Sirainen
@ 2003-08-28  9:48                     ` Martin Konold
  0 siblings, 0 replies; 56+ messages in thread
From: Martin Konold @ 2003-08-28  9:48 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: linux-kernel

Am Thursday 28 August 2003 11:27 am schrieben Sie:

Hi,

> It's not about CPU usage. It's mostly about being able to modify the
> file even when there's thousands of simultaneous readers that could
> otherwise keep the file locked almost constantly.

No, the readers only have to check if the file got locked by the writer(s).
  
> Also it'd be nice to support NFS with .lock files since no-one really
> uses lockd.

IMHO a lock file on nfs is the most inefficient locking mechanism in case of 
readers and writer(s) on the same machine.

BTW: What about moving this thread away from linux-kernel? It is already clear 
that the kernel does not provide these guarantees you asked for.

Regards,
-- martin

Dipl.-Phys. Martin Konold
e r f r a k o n
Erlewein, Frank, Konold & Partner - Beratende Ingenieure und Physiker
Nobelstrasse 15, 70569 Stuttgart, Germany
fon: 0711 67400963, fax: 0711 67400959
email: martin.konold@erfrakon.de


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28  8:57                   ` Timo Sirainen
@ 2003-08-28  9:56                     ` David Schwartz
  2003-08-28 10:26                       ` Timo Sirainen
  2003-08-28 12:44                       ` Ragnar Hojland Espinosa
  0 siblings, 2 replies; 56+ messages in thread
From: David Schwartz @ 2003-08-28  9:56 UTC (permalink / raw)
  To: Timo Sirainen, Jamie Lokier; +Cc: linux-kernel


> On Thu, 2003-08-28 at 09:13, Jamie Lokier wrote:

> > Timo Sirainen wrote:
> > > I'm sure someone has figured out a way to make a checksum of
> > > data that
> > > can detect if there's even a single bit wrong, if the checksum is
> > > allowed to take as much space as the data itself. I should read more
> > > about algorithms..
> >
> > You said that MD5 wasn't strong enough, and you would like a guarantee.

> Yes. I don't really like it if my program heavily relies on something
> that can go wrong in some situations.

	Okay, this is too much. Your alternative, assuming the kernel won't
re-order writes, is clearly relying on something that can go wrong. MD5
would be orders of magnitude more reliable.

	No two data sets with the same MD5 hash are known. It will be many, many
years before anyone finds two data sets of the same size with the same MD5
hash. The odds of having two data sets just happen to have the same MD5 has
are  infinitesimal.

	If you compared the MD5 hashes of random data sets 1,000,000 times a second
for a million years, your odds of getting a single false match on any one of
those comparisons are less than one in a quadrillion.

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 12:42 ` Martin Konold
  2003-08-27 12:52   ` Timo Sirainen
@ 2003-08-28 10:08   ` Matthias Andree
  1 sibling, 0 replies; 56+ messages in thread
From: Matthias Andree @ 2003-08-28 10:08 UTC (permalink / raw)
  To: linux-kernel

On Wed, 27 Aug 2003, Martin Konold wrote:

> Am Wednesday 27 August 2003 02:37 pm schrieb Timo Sirainen:
> 
> Hi,
> 
> > The question is what can happen if I read() a file that's being
> > simultaneously updated by a write() in another process?
> 
> The behaviour is undefined.
>  
> > 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> 
> Yes. The actual order stuff is written to the disk is not guaranteed.

What has the disk block write order got to do with it? It's not as
though write() would be cached but read() see the raw disk.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28  9:56                     ` David Schwartz
@ 2003-08-28 10:26                       ` Timo Sirainen
  2003-08-27 22:58                         ` Nagendra Singh Tomar
                                           ` (2 more replies)
  2003-08-28 12:44                       ` Ragnar Hojland Espinosa
  1 sibling, 3 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28 10:26 UTC (permalink / raw)
  To: David Schwartz; +Cc: Jamie Lokier, linux-kernel

On Thu, 2003-08-28 at 12:56, David Schwartz wrote:
> > > You said that MD5 wasn't strong enough, and you would like a guarantee.
> 
> > Yes. I don't really like it if my program heavily relies on something
> > that can go wrong in some situations.
> 
> 	Okay, this is too much. Your alternative, assuming the kernel won't
> re-order writes, is clearly relying on something that can go wrong.

Reorder on per-byte basis? Per-page/block would still be acceptable.

Anyway, the alternative would be shared mmap()ed file. You can trust
32bit memory updates to be atomic, right?

Or what about: write("12"), fsync(), write("12")? Is it still possible
for read() to return "1x1x"?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 12:37 Lockless file reading Timo Sirainen
  2003-08-27 12:42 ` Martin Konold
@ 2003-08-28 10:54 ` Robin Rosenberg
  2003-08-28 12:42   ` Jamie Lokier
  1 sibling, 1 reply; 56+ messages in thread
From: Robin Rosenberg @ 2003-08-28 10:54 UTC (permalink / raw)
  To: linux-kernel

Aren't Linux files also streams. Writing to a stream in sequence
should update the stream in sequence, or it wouldn't be a stream,
would it?

-- robin

onsdagen den 27 augusti 2003 14.37 skrev Timo Sirainen:
> (Maybe a bit off topic, but I couldn't get answers elsewhere and it is
> about kernel behaviour)
>
> The question is what can happen if I read() a file that's being
> simultaneously updated by a write() in another process? If I'm writing
> 123 over XXX, is it possible that read() returns 1X3 in some conditions?
> Is the behaviour filesystem specific? Any idea about other operating
> systems?
>
> I'm thinking about implementing lockless file reads that work like:
>
> void write_data(int fd, off_t offset, void *data, size_t size) {
>   lock_file(fd);
>   pwrite(fd, data, size, offset); // or writev() or copy+single pwrite()
>   pwrite(fd, data, size, offset + size);
>   unlock_file(fd);
> }
>
> void read_data(int fd, off_t offset, void *data, size_t size) {
>   unsigned char buf{size*2];
>   for (;;) {
>     pread(fd, buf, size*2, offset); // or shared mmap()ed access
>     if (memcmp(buf, buf+size, size) == 0) break;
>     usleep(10);
>   }
>   memcpy(data, buf, size);
> }
>
> The only case when I see that it would break is if we write "1212" but
> pread() sees "1X1X" or "X2X2" there.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 10:26                       ` Timo Sirainen
  2003-08-27 22:58                         ` Nagendra Singh Tomar
@ 2003-08-28 12:01                         ` Jamie Lokier
  2003-08-28 13:28                           ` Timo Sirainen
  2003-08-28 20:24                         ` David Schwartz
  2 siblings, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 12:01 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: David Schwartz, linux-kernel

Timo Sirainen wrote:
> Reorder on per-byte basis? Per-page/block would still be acceptable.

The _CPUs_ can reorder on a per-byte basis, on a multiprocessor.  It
has nothing to do with the kernel.

> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?

Atomic yes (if aligned), weakly ordered though.

> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

Yes it is possible, in principle.

This is what happens: the writing CPU writes "1", "2" in order.  The
reading CPU reads bytes 1, 3 before the writes are
observed and bytes 0, 2 after.  The CPU can do this.

The kernel doesn't prevent this, because it doesn't hold any exclusive
lock between the writer and reader during the data transfers.
Furthermore the kernel transfers a byte at a time, on some
architecture (including x86), if any buffer is not aligned.

It is very unlikely to return "1x1x", but you should know it is not
architecturally impossible.  Given your incomplete knowledge of every
architectural quirk, it is more likely to occur than an MD5 collision.

On 32-bit aligned atomicity: if the block of 4 bytes is aligned in
memory, then with shared mmap you will only see whole words
transferred because all (current) Linux SMP-capable architectures
offer 32 bit atomicity.  It is not a very nice assumption: it doesn't
hold for 16 bits or 64 bits, and may not hold for a future 64 bit
architecture.  Keep in mind the words stay whole, but multiple words
are read out of order.

With read() and write(), even aligned 32-bit words don't work.  On
some 64-bit architectures, a 32-bit word read() will be issued as 4
byte reads at the machine level, with weak memory ordering.  (I'm
reading Alpha and IA64 __copy_user right now, and they both do that).

Enjoy :)
-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  0:03       ` Jamie Lokier
@ 2003-08-28 12:08         ` Richard B. Johnson
  2003-08-28 12:39           ` Jamie Lokier
  0 siblings, 1 reply; 56+ messages in thread
From: Richard B. Johnson @ 2003-08-28 12:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Timo Sirainen, Martin Konold, linux-kernel

On Thu, 28 Aug 2003, Jamie Lokier wrote:

> Richard B. Johnson wrote:
> > Let's see if it is possible for the middle byte of
> > a 3-byte sequence to not be written when both
> > other bytes are written:
>
> > Even in machines that do load/store operations where individual
> > components of those stores happen in groups, access to/from
> > a buffer of such data is controlled (by hardware) so a write
> > will complete before a read occurs.
>
> I don't understand what you mean by this.
>
> Do you mean that the writes are forced to appear on a different CPU in
> the same order that they were written?  That isn't true on x86, for
> two reasons: 1. writes aren't always in processor order (see
> CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
> processor are out of order anyway.
>

I never said, nor even implied any such thing.

> > With hardware that can perform byte-access (ix86), the only
> > byte-access that is going to happen is at the end(s) of buffers.
>
> Not true.  Take a look at __copy_user() in arch/i386/lib/usercopy.c.
> The first few bytes are copied using "rep;movsb", which is not
> guaranteed to use a word write for the aligned pair, nor is it
> guaranteed a particular timing (there could be an interrupt between
> each byte).
>

Of course it's true. The implimentation detail of trimming the
start or finish of a copy procedure trims the ends. That's,
in fact, why it is impossible, yes, __impossible__ for the byte-sequence
0xABC to have both 0xA and 0xC copied without the 0xB being copied also.
There are no combinations of byte/word/long writes that will allow
this to happen.

It doesn't make any difference about the endian either as I
have previously shown.

Whether or not there's an interrupt between each byte means nothing
also. In the byte copy described, an interruopt at any time will
simply leave:

XXX
XXC
XBC
ABC

Clearly, if both A and C were copied, B was copied also.
You can screw around with endian and use words and longwords
as well. It just isn't possible for, in any 3-byte sequence
for the middle byte to be missing.


If 0x0A, 0x,0B, 0x0c, exist in memory as a word (longword),
with Intel, the lowest byte is in the lowest memory location.
Therefore, we have a word of 0xXCBA or 0xCBAX, depending upon
whether the high nibble or the low nibble becomes part of the
short. Notice that the 'B' is always between 'A' and 'C'?
Therefore if A and C got copied, B must have also been copied.

> Other architectures are similar.
>
> -- Jamie
>

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 22:58                         ` Nagendra Singh Tomar
@ 2003-08-28 12:18                           ` Jamie Lokier
  2003-08-28  0:39                             ` Nagendra Singh Tomar
  2003-08-28 21:49                             ` Bernd Eckenfels
  0 siblings, 2 replies; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 12:18 UTC (permalink / raw)
  To: Nagendra Singh Tomar; +Cc: Timo Sirainen, David Schwartz, linux-kernel

Nagendra Singh Tomar wrote:
> > Or what about: write("12"), fsync(), write("12")? Is it still possible
> > for read() to return "1x1x"?
>
> Th read will still read both "12"s from the cache (in very high possibility).

That is a terrible assumption.

On all the Linux architures, even single CPUs, with no fancy memory
ordering, even 386, Pentiums etc., it is possible for read() to return
"1x1x".

Like this: the write() is preempted (see CONFIG_PREEMPT) after writing
one "1" byte, the read() runs, reads "1x" and is then preempted back
to the writer, which writes "2", calls fsync(), writes "1" and is then
preempted back to the reader, which continues and reads its second
"1x".

Interrupts caused by network packets arriving at just the wrong moment
can trigger that sort of thing.

> Even if the page caches are stolen (due to mem shortage) the kernel
> file cachecing will ensure that you get consistent data.

The kernel does not provide synchronisation between read() and write()
data transfers, and it does not always use atomic 32-bit reads and
writes either.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 12:08         ` Richard B. Johnson
@ 2003-08-28 12:39           ` Jamie Lokier
  0 siblings, 0 replies; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 12:39 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Timo Sirainen, Martin Konold, linux-kernel

Richard B. Johnson wrote:
> > > Even in machines that do load/store operations where individual
> > > components of those stores happen in groups, access to/from
> > > a buffer of such data is controlled (by hardware) so a write
> > > will complete before a read occurs.
> >
> > I don't understand what you mean by this.
> >
> > Do you mean that the writes are forced to appear on a different CPU in
> > the same order that they were written?  That isn't true on x86, for
> > two reasons: 1. writes aren't always in processor order (see
> > CONFIG_X86_OOSTORE and CONFIG_X86_PPRO_FENCE); 2. reads on the other
> > processor are out of order anyway.
> 
> I never said, nor even implied any such thing.

That's why I said I don't understand what you meant.  What does your
first paragraph above mean?

> [...] The implimentation detail of trimming the start or finish of a
> copy procedure trims the ends. That's, in fact, why it is
> impossible, yes, __impossible__ for the byte-sequence 0xABC to have
> both 0xA and 0xC copied without the 0xB being copied also.  There
> are no combinations of byte/word/long writes that will allow this to
> happen.

Correct.

> Whether or not there's an interrupt between each byte means nothing
> also. In the byte copy described, an interruopt at any time will
> simply leave:
> 
> XXX
> XXC
> XBC
> ABC
> 
> Clearly, if both A and C were copied, B was copied also.
> You can screw around with endian and use words and longwords
> as well. It just isn't possible for, in any 3-byte sequence
> for the middle byte to be missing.

What you say is correct about the intermediate states from
the writer's point of view, however a parallel reader _can_ observe
the middle byte missing.

The writing proceeds: A, B then C.  In that order.  No problem.

The writing CPU or task will always see these memory states:

   XXX
   AXX
   ABX
   ABC

Two problems for the reader, though: multiprocessor out of order
reads, and uniprocessor preemption.

Out of order means the reader can read the second byte _before_ the
first byte.  Which means the reader sees this:

  ...
  .X.
  AX.
  AXB

Preemption means this sequence of events can occur:

  Writer's                              Reader's
  destination buffer                    destination buffer
  --------------------------------------------------------

  XXX
	writer writes A
  AXX
	context switch to reader
	reader reads 2 bytes
					-> AX.
	context switch to writer
	writer writes B, C
  ABX
  ABC
	context switch to reader
	reader reads final byte
					-> AXB

The preemption case is a classic race between a writer and reader
moving in the same direction, alternating who is ahead.

Parallel memory access is not as simple as you make it out to be.  If
you don't understand this, you will have a hard time understanding locks,
set_task_state vs. __set_task_state, the wmb/rmb/smp_rmb() macros, etc.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 10:54 ` Robin Rosenberg
@ 2003-08-28 12:42   ` Jamie Lokier
  0 siblings, 0 replies; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 12:42 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

Robin Rosenberg wrote:
> Aren't Linux files also streams.

No, they aren't.  Stream sockets, pipes and FIFOs are.  Files aren't.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  9:56                     ` David Schwartz
  2003-08-28 10:26                       ` Timo Sirainen
@ 2003-08-28 12:44                       ` Ragnar Hojland Espinosa
  2003-08-28 13:03                         ` Jamie Lokier
  2003-08-28 20:37                         ` Lockless file reading David Schwartz
  1 sibling, 2 replies; 56+ messages in thread
From: Ragnar Hojland Espinosa @ 2003-08-28 12:44 UTC (permalink / raw)
  To: David Schwartz; +Cc: Timo Sirainen, Jamie Lokier, linux-kernel

On Thu, Aug 28, 2003 at 02:56:29AM -0700, David Schwartz wrote:
> 	No two data sets with the same MD5 hash are known. It will be many, many
> years before anyone finds two data sets of the same size with the same MD5
> hash. The odds of having two data sets just happen to have the same MD5 has
> are  infinitesimal.

It can happen.  It happened to me with two gifs.  FWIW.
-- 
Ragnar Hojland - Project Manager
Linalco "Specialists in Linux and Free Software"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  0:39                             ` Nagendra Singh Tomar
@ 2003-08-28 13:00                               ` Jamie Lokier
  2003-08-28  1:06                                 ` Nagendra Singh Tomar
  0 siblings, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 13:00 UTC (permalink / raw)
  To: Nagendra Singh Tomar; +Cc: Timo Sirainen, David Schwartz, linux-kernel

Nagendra Singh Tomar wrote:
> While the write had "12" in its buffers and it  would have grabbed the 
> page lock to write it into the page cache, won't it set some flag saying 
> that I don't want to be prempted now. I think there is a small primitive 
> for it in from 2.5 onwards. I don't think it will be a good idea to prempt 
> while it is holding the page lock. How is it possible that it just wrote 
> "1" and did not write "2" though it had grabbed the page lock for that 
> purpose. 

Nope.  I don't see any disabling of preemption while the page is held.

It wouldn't make sense anyway, because the copies to/from userspace
can sleep, so there's nothing to gain by disabling preemption.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 12:44                       ` Ragnar Hojland Espinosa
@ 2003-08-28 13:03                         ` Jamie Lokier
  2003-08-28 17:26                           ` root
  2003-08-28 20:37                         ` Lockless file reading David Schwartz
  1 sibling, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 13:03 UTC (permalink / raw)
  To: Ragnar Hojland Espinosa; +Cc: David Schwartz, Timo Sirainen, linux-kernel

Ragnar Hojland Espinosa wrote:
> It can happen.  It happened to me with two gifs.  FWIW.

Probability on the order of 2^-32 with MD5 any-pairs collision.
(It's not usual to have so many GIFs to compare, though :)
SHA is better, and both probably have some weakness that increases the
probability of collision.

Do you still have the GIFs?

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-27 21:52                     ` Nagendra Singh Tomar
@ 2003-08-28 13:26                       ` Matthias Andree
  0 siblings, 0 replies; 56+ messages in thread
From: Matthias Andree @ 2003-08-28 13:26 UTC (permalink / raw)
  To: linux-kernel

On Thu, 28 Aug 2003, Nagendra Singh Tomar wrote:

> If the writer does not want the reader to read old data why does'nt it 
> truncate the file and start writing from the begining every time.

To give you some background: Timo wants to store cache data in these
lockless. When these caches are truncated, they can be dropped
altogether, but the application's performance will drop drastically.

-- 
Matthias Andree
Encrypted mail solicited                Verschlüsselte Mail erwünscht
http://blackhole.pca.dfn.de:11371/pks/lookup?op=get&search=0x052E7D95
-> http://www.gnupg.org/(en)/related_software/frontends.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 12:01                         ` Jamie Lokier
@ 2003-08-28 13:28                           ` Timo Sirainen
  0 siblings, 0 replies; 56+ messages in thread
From: Timo Sirainen @ 2003-08-28 13:28 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Thursday, Aug 28, 2003, at 15:01 Europe/Helsinki, Jamie Lokier wrote:

> This is what happens: the writing CPU writes "1", "2" in order.  The
> reading CPU reads bytes 1, 3 before the writes are
> observed and bytes 0, 2 after.  The CPU can do this.
>
> The kernel doesn't prevent this, because it doesn't hold any exclusive
> lock between the writer and reader during the data transfers.
> Furthermore the kernel transfers a byte at a time, on some
> architecture (including x86), if any buffer is not aligned.

Thanks, this was the kind of information I wanted to know. I assumed 
there were exclusive writer locks and such.

I think I've finally decided what to do. Mostly due to realization that 
the most problematic part was actually not a problem at all since it 
required locking anyway. :) Here's the summary and then I'll shut up 
here:

This was about index files for mailboxes. I had already written support 
for maildir, but mbox support was where I got stuck. I needed four kind 
of file updates:

1) File offset pointers (linked list). These are set only once, never 
modified. Originally I used lowest bit to specify that the offset was 
"used". I first wrote them without the bit, fsync, then the bit. 
Apparently this isn't enough then, so I'll change all 4 bytes to 
contain the used-bit. This gives 28bit number, but I can deal with 8 
byte alignments so maximum file size is 2GB which is enough.

2) mbox message offsets. These change when messages are deleted or 
sometimes with flag changes. But the mbox file has to be locked for 
reading anyway if I want to use these offsets, so there's no way 
someone is updating them at the same time.

3) Index summary. Total number of message count, unread message count, 
etc. There's not many fields here, so I could try different kinds of 
ideas here.. One that at least works is to keep them in separate file 
and update it with rename().

4) Flag fields. There aren't mutually exclusive flag changes, so it 
doesn't matter if I read some of the changes partially.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 13:03                         ` Jamie Lokier
@ 2003-08-28 17:26                           ` root
  2003-08-28 17:35                             ` Jamie Lokier
  2003-08-28 21:59                             ` Bernd Eckenfels
  0 siblings, 2 replies; 56+ messages in thread
From: root @ 2003-08-28 17:26 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ragnar Hojland Espinosa, David Schwartz, Timo Sirainen, linux-kernel

> 
> Ragnar Hojland Espinosa wrote:
> > It can happen.  It happened to me with two gifs.  FWIW.
> 
> Probability on the order of 2^-32 with MD5 any-pairs collision.
> (It's not usual to have so many GIFs to compare, though :)
> SHA is better, and both probably have some weakness that increases the
> probability of collision.
> 
> Do you still have the GIFs?

MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.
There arn't that many GIFs in the world.
I'd be really surprised if there were that many pictures in the world.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 17:26                           ` root
@ 2003-08-28 17:35                             ` Jamie Lokier
  2003-08-28 18:10                               ` root
  2003-08-28 21:59                             ` Bernd Eckenfels
  1 sibling, 1 reply; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 17:35 UTC (permalink / raw)
  To: root; +Cc: Ragnar Hojland Espinosa, David Schwartz, Timo Sirainen, linux-kernel

root@mauve.demon.co.uk wrote:
> > Probability on the order of 2^-32 with MD5 any-pairs collision.
> MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.

Right.  Dozy me :)

> > Do you still have the GIFs?
> 
> There arn't that many GIFs in the world.
> I'd be really surprised if there were that many pictures in the world.

I'd be really surprised if what you saw wasn't a software error,
misreporting or miscalculating the MD5.

-- JAmie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 17:35                             ` Jamie Lokier
@ 2003-08-28 18:10                               ` root
  0 siblings, 0 replies; 56+ messages in thread
From: root @ 2003-08-28 18:10 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: root, Ragnar Hojland Espinosa, David Schwartz, Timo Sirainen,
	linux-kernel

> 
> root@mauve.demon.co.uk wrote:
> > > Probability on the order of 2^-32 with MD5 any-pairs collision.
> > MD5 is 128 bit output, so that's around 2^64 pairs before you have a birthday.
> 
> Right.  Dozy me :)
> 
> > > Do you still have the GIFs?
> > 
> > There arn't that many GIFs in the world.
> > I'd be really surprised if there were that many pictures in the world.
> 
> I'd be really surprised if what you saw wasn't a software error,
> misreporting or miscalculating the MD5.

Or perhaps more likely, truncating the hash to 32 bits, in which case for most
people there won't be a problem, as a collision isn't likely until you get
to tens of thousands of images.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28  8:42               ` Valdis.Kletnieks
@ 2003-08-28 20:13                 ` David B. Stevens
  0 siblings, 0 replies; 56+ messages in thread
From: David B. Stevens @ 2003-08-28 20:13 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: nagendra_tomar, Timo Sirainen, Jamie Lokier, root, Martin Konold,
	linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> On Thu, 28 Aug 2003 00:12:30 +0530, Nagendra Singh Tomar said:
> 
> 
>>Why do u require file locking if there is a *single* writer ?? I don't 
>>understand why a 123 written over XXX can result in 1X3.
> 
> 
> You'd be *amazed* at what can go strange (not wrong, just strange) at the
> hardware level.  Let's assume for the sake of argument that  in your '123' and
> 'XXX', each character represents 4 or 8 or however many bytes wide your memory
> bus is (so "123" is really a 24 byte string).  The following can happen
> (especially if your hardware doesn't have cache coherency):
> 
> 1) CPU 0 stores the 24 bytes, and it splits across a cache line boundary.
> The '12' goes in one line, '3' in the next.
> 
> 2) Cache controller 0 does the writeback of '3' first.
> 
> 3) Cache controller 0 starts the writeback of the other cache line,
> and gets the '1' written, still waiting for next memory cycle to write '2'.
> 
> 4) CPU 1 or a DMA device snarfs up the 24 bytes before the '2' gets there.
> '3' got there, '1' got there, '2' will get there the *NEXT* bus cycle.
> 
> 1X3.  Whoops. 
> 
> Real hardware does this sort of thing all the time. Consider *this* gem from
> the IBM S/370 Principles of Operation (GA22-7000-10, page 5-12):
> 

Snip

 >
 >
>  In addition, if the format
> *used* to be a %4d that wrote to the last 4 bytes of a page, and somebody nails
> it to be a %8d halfway through, we may or may not get a SEGV when we scribble 4
> bytes onto the next page in memory, even if it's a readonly page we scribbled
> on....
> 

Ah, but the scribble will not take place.

> The description of what order things are seen to happen in runs for 11 pages,
> *not* including special-cases like the above....
> 

BTW you are quoting from my all time favorite book. Conceptual order is 
not always the real order and all those mind games.

Cheers,
   Dave


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28 10:26                       ` Timo Sirainen
  2003-08-27 22:58                         ` Nagendra Singh Tomar
  2003-08-28 12:01                         ` Jamie Lokier
@ 2003-08-28 20:24                         ` David Schwartz
  2 siblings, 0 replies; 56+ messages in thread
From: David Schwartz @ 2003-08-28 20:24 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Jamie Lokier, linux-kernel


> On Thu, 2003-08-28 at 12:56, David Schwartz wrote:

> > > > You said that MD5 wasn't strong enough, and you would like
> > > > a guarantee.

> > > Yes. I don't really like it if my program heavily relies on something
> > > that can go wrong in some situations.

> > 	Okay, this is too much. Your alternative, assuming the kernel won't
> > re-order writes, is clearly relying on something that can go wrong.

> Reorder on per-byte basis? Per-page/block would still be acceptable.

	There is no law that says the kernel can't re-order on a per-byte basis.
Memory visibility and posted writes could, at least in theory, result in
this outcome today. Future technology may make it even more probable.

> Anyway, the alternative would be shared mmap()ed file. You can trust
> 32bit memory updates to be atomic, right?

	Yes, but you can't trust that multiple 32 bit memory updates won't be seen
out of order by another processor without appropriate barriers or
synchronization.

> Or what about: write("12"), fsync(), write("12")? Is it still possible
> for read() to return "1x1x"?

	I believe it still is. The 'fsync' function does not synchronize another
process' view of memory. It is not a memory barrier and, in any event, if it
was a memory barrier it would be running in the wrong place. The 'fsync'
function syncs the memory with stable storage, it does not synch (or at
least, it is not guaranteed to sync) with the memory view of another
process.

	Nothing stops speculative reads for the two halves of the data, each in
their own 32-bit unit, from crossing in time. Even if it's not possible on
today's hardware, it is still possible in principle.

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28 12:44                       ` Ragnar Hojland Espinosa
  2003-08-28 13:03                         ` Jamie Lokier
@ 2003-08-28 20:37                         ` David Schwartz
  2003-08-28 22:11                           ` Bernd Eckenfels
  1 sibling, 1 reply; 56+ messages in thread
From: David Schwartz @ 2003-08-28 20:37 UTC (permalink / raw)
  To: Ragnar Hojland Espinosa; +Cc: linux-kernel


> On Thu, Aug 28, 2003 at 02:56:29AM -0700, David Schwartz wrote:

> > No two data sets with the same MD5 hash are known. It will
> > be many, many
> > years before anyone finds two data sets of the same size with
> > the same MD5
> > hash. The odds of having two data sets just happen to have the
> > same MD5 has
> > are  infinitesimal.

> It can happen.  It happened to me with two gifs.  FWIW.

	Find those GIFs, double-check, and publish immediately. That would be
amazingly big news and would probably cause huge numbers of people to switch
from MD5 to SHA1 overnight.

	Far more likely, you are mistaken.

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 12:18                           ` Jamie Lokier
  2003-08-28  0:39                             ` Nagendra Singh Tomar
@ 2003-08-28 21:49                             ` Bernd Eckenfels
  1 sibling, 0 replies; 56+ messages in thread
From: Bernd Eckenfels @ 2003-08-28 21:49 UTC (permalink / raw)
  To: linux-kernel

In article <20030828121823.GB6800@mail.jlokier.co.uk> you wrote:
> The kernel does not provide synchronisation between read() and write()
> data transfers, and it does not always use atomic 32-bit reads and
> writes either.

There are some gurantees for pipes, for example. Writes which are smaller
than pathconf(_PC_PIPE_BUF, pipe) will be atomic, i.e. not mixed between
multiple writers. I dont know what the gurantee for write/read or mmap is.

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 17:26                           ` root
  2003-08-28 17:35                             ` Jamie Lokier
@ 2003-08-28 21:59                             ` Bernd Eckenfels
  2003-08-28 23:02                               ` Jamie Lokier
  2003-08-28 23:44                               ` Lockless file readingu root
  1 sibling, 2 replies; 56+ messages in thread
From: Bernd Eckenfels @ 2003-08-28 21:59 UTC (permalink / raw)
  To: linux-kernel

In article <200308281726.SAA24033@mauve.demon.co.uk> you wrote:
> I'd be really surprised if there were that many pictures in the world.

Well, this is about probabilty. It does not mean that you need 2^64
pictures, neighter does it mean you have a collision within 2^64 pictures.

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 20:37                         ` Lockless file reading David Schwartz
@ 2003-08-28 22:11                           ` Bernd Eckenfels
  2003-08-28 23:00                             ` Jamie Lokier
  2003-08-29  0:47                             ` David Schwartz
  0 siblings, 2 replies; 56+ messages in thread
From: Bernd Eckenfels @ 2003-08-28 22:11 UTC (permalink / raw)
  To: linux-kernel

In article <MDEHLPKNGKAHNMBLJOLKEEEEFMAA.davids@webmaster.com> you wrote:
>        Find those GIFs, double-check, and publish immediately. That would be
> amazingly big news and would probably cause huge numbers of people to switch
> from MD5 to SHA1 overnight.

Why is that? For a hash with n bits there are at least
2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
Three are even more, if you consider all files up to this size.

With n=128 (md5) and y=80000 (10k file) x=2^79872 different files with the same
hash, which is:

73758771305612149138757973947840453624822288693642912037251984265061\
00114977473466610342516138996953532522212913684506495987369027376247\
15680102183899956755293229604036298841899346896312975318416816792181\
27163102430377376805525407677117167195273455353500937474682475767850\
59179157052570202917558881297004500267146311552672331119254293941288\
81200564802772472735124212232463022640846529434844027943281412745730\
85139886962923676798297010059108583228148158547397805954834594989809\
72953082523096457840274668781301014306803374729359559243555477575414\
40743184860888676456545815644014580307967389124314445607714035149073\
07589212426013390575273338209003958822345762588652779339518192543309\
86317487722170491968583142374006999722192982951269179790768422998167\
08449702141611939030460451116251870263862094621134152440907522456787\
51931691226465546945055073850028999878309015830845841569473374428818\
38069014874242916295358660843459867579762128717086485255524813917818\
93048437497315173334429979530572030202862199229371143871214566570030\
75727422692477487723119025817627046899875118936499627544458020896601\
57937925093115270533258873516793039117097872784457478242527803568668\
28631394601551771044737759212698144953003036178706026830838663271584\
01592980859584723256093157100606730021922520778222789293516580288692\
25264244660543124963539004344523631091601054705861145611115771926445\
77478534613667305980110146817904878379835989090392217842151012038447\
32903904630045111054945146422201082791203805511249222706103705145946\
71820615624444389237027551184044272914241672227827166453175228687028\
66938683553829374974281116729061361295584136766500184918365115021199\
37169541145849843445053788883266574169813935549175907371700139138138\
72502071050440159721888802465338768753780473818364722509365607509899\
90281003295942592341570547210474186910921077818634263104420109401884\
65729426198803625408484236068350211025109659994524663107915732813683\
23249026269135868577867279765840583230539982435092205081214139636131\
52716158876922016303074856693366702970999646715276255547943011782454\
30451711340888832195663444371564861489518840602769168182051363706921\
56348391007580137339506987160160155329372785718654816692279404095002\
71631290709679864507500640424102364928586816669728574941400097861766\
98623345776769276035921922328812527260139781666739266290331711100138\
69355520506331079686339512263367217460717946825089940625062127323177\
86653058251722203905147108918559635723363913322812633573852636358143\
05929154109676275195321548031664387835763338445788114635589758368895\
17974907569210022960592177775660950422645740978394883220940559863725\
17922274955785951581660617914435597347391402963730377168547070736184\
09409184312862027981853714218127309283218818430216375047782183166306\
16773059670202724953584786650536914280599829953639402884772276266175\
74429349374567546167841892184392743355864837653142407213100256586750\
41336949647306338065403113935569936012036214119515511822780434287340\
30928734800002296824907120604076897989376531011143837044498904591969\
67573317525873614600338869535377596304879781797541287906717071752299\
79432009185387838666393216748127057247821526828251843082459605902721\
91898927046995849949526294260125716903061767423516291316259448887102\
21455117996027263044925703914568589351096240255287529058558303698852\
12851051794252832424631653450392428817783210001109070180912805870873\
07386072771724226933940298630758819641696078651336928520241718916016\
03519077021589676947417102185394484893439475195997406265182485474944\
59178017073695377899466712572589814669494675411536769252318932662247\
42203943726665732107904396523581039784309768624193293669453454613822\
15749223233434417462994719818283525937442572314803099751885933759868\
16634054700507294873752867437781563278323662445301928131033228617590\
35430194103956670055593952506967036312263794347882662751144124966996\
91351677629105493982971530018612054379318441420814383739436560078198\
43138418873116377898447796602758998473352233358353697569513976917684\
45423037648290442116492517078342243822589633273690408706974851274325\
08132319738846084567101570415478070907066126456504167969509604493072\
48645914957093677769177377246870165130337742440748526644770151979128\
47211604618032267059204616591999117507548440215522308874869279619705\
44790645926520397665781107515922450033794810695414009529069301089709\
73157072428268136340204167242984153890552777305481674601484732441017\
28560824403131840588191220521738725204457330809940997294315972715083\
60725305431083136292963174671514146443703870465945021837055000033974\
71343654123104963598692282015717496687304002201274588103720513379013\
73625443109735850169152345779367135337891706905992705940277701946037\
51485390147093579878750275097249158482766343355558089592208214469361\
42276931894022033384129759100364979210001950264216905954876971408322\
93770122779579100592816803465994717468500525380304100505203551456609\
49222119718024454375095356130282351572692787346489597961086730476245\
84334725114957596993473817425633275365545763431282993581470273699983\
69757127063885542311338579547488527586375582908609448860018780419534\
19537297554404836539225811986557499970844823045184713996979552418442\
74406760049159007279015378851550531580664049422653169603166655301827\
48599352193666664508497247573119948382046576805640670494290298734858\
84417223231477087092260452907970029505261005523561715560411919810056\
26325051985436811764833953794260944157856829720560313439000417881996\
66649136236854393658468387755348570975402758189647533662960697427372\
12412647608184960382793382969692900066773207105873069629313639145519\
80713895245632909242920773336712845268455388312926807578615068583681\
14735015718512752431993230816015076649376333137154762997959604419405\
40830019834949523547804904348975842603471995834045489085385950422244\
21421527056856660062948296226796218876244177325383754033674652041604\
73794980904792158435418276687436661921442912299388028773041675317063\
41702985744933602467081363806581646503634957766157218777471663915518\
02627490937845991300518517420026319487215368719584336485771425750960\
74147906392458647321632082098445775029982834370708320917720129979770\
70455506307814353531726531033563236535836642541212334872049648262139\
75871050354308249098506454123665816967377687075573519014546199680718\
25504901192973448419707741143954204819897813037578850381034208321065\
90313529679084091518649506694968971633222734828354332911155417094032\
10171315988545986151874205830881181014135272806355975146559095444333\
46904167367553579042699361907200322278137701375922580419478734423318\
06121488248550399697475750553850800216488249799595528934562185651076\
98483900622844796406822349397868990356392508055418557772495336464051\
84315867243632263128883584334516110866716220279538827000535200550974\
00891217955190870275336026052153821633475850086009198501611505257778\
47933885680755163296870602837070256783151019285961820158384372038595\
17860622613336775309534151796578569381023269452368108582507056951247\
98793404504540912203957292887715691948144120313016926351322062639192\
87532878209614118770056914646891704747300580355118133346576439497968\
43441915977050748964333072943054209320237827891228838972888080894003\
90308897490656128265393212239169918857054561600215188074809617435067\
81432282933335020500424666690346752022172982638444273323437620748833\
52465137449659970847203193277839245531164554835482556691810957680344\
96966689035036985236456367028660016192169034841287219164333818992502\
01221473697763200027508042141909843428691750233206647821600433304062\
18891775169474513534619981893727149433699742028109319843146130300790\
48374620733930112399742313419818337405590406519992557821007628973929\
00113044791623585643793713735052077993606309171905986641896042192953\
02914470680005418423614268207498056056631718569685347881208192391048\
12736436159629914835075343477182008191330928785768860373127939973177\
63052494277204358579240174730488245462349822120037469207659121017510\
01218092291272699027706462817620604198339134745323293916578572282549\
24116244533223866364856865317114967557066663536376343128567680906720\
61764562620886039013573861476592226708854287846592304969334912469803\
12507139147064808903612376733627058308253758999654545653079870971966\
39352239897170601807293538529571930214655824923553210628065441410082\
64909257175626977532187270976620686987192211027534081125067087406140\
04272207570560217022276862167201738705591967047132029817306091706167\
88511161829178775447774522166205835983174468078354646490355262626185\
92707769692650015463072254012707290128020704879925204262894366607928\
64569688539411097568820564002708260545302498646096613541573488053035\
29464112666464799579334001389463815093626821732035568867075693046582\
56066956011728344932544832226503455818172038856422008742254965154356\
66307452484032159959349389526776258180342580543432325682116409135613\
88088767367314844519622100290218802495202321475300477340729565680783\
24011223601204141085691954438204937682332227360382936775988886545128\
93525280821911159972971397574162898085649998565393045404760240604331\
59456659730073515088831333039887309968528845217110079369331693213177\
05558397796379108248708431620233189642397710829943933934005835585902\
17513252794090844735190314941878898989170740854188417518129638056086\
07436762139072615685935122969736651233387075068866259306433410312065\
73504562435401771168805144280321002849860775467143169339505355458866\
35914375988685051115119841307132244418293162454216614670787597317746\
66702911316829023341512709925133948717354523200459998088061675060047\
00566452526710346126651489557268627487651204778699369531585918862201\
59394424745148402742835964929610173703136610804997257202644221090466\
98365297814614744314991928707978437775163172405897520668685112787838\
76745065718227414525009485074009755851181719599276687557225114880046\
03173249464073454779191764341396098872621419423704644007522244100096\
27738372944337782228361617675552100763578393891200653283263519950059\
32714093086973120414639660768933627220032232042053664212501154040067\
41174802564325712314623332216263370677113891935181869723499560650683\
86380815056042861307598414035089442567526704536760404967820726566472\
34670879286692361894969928723752609821775353401149541579651197092514\
39298884447458964811457837922247753902245800878298397829819244600259\
60945304748005444929411694097677373095316194496470938817202711690246\
79164873737588149815425937970573045094656554393364961182798975769812\
79063271606896233814472483059102258115803176451232829617731169473560\
56583472626947527000930777640604734587567158760154364347063261832083\
30385996594589042026405412914351169274159744595075192817784749916345\
62470152751910309001973543268287070146028018506822943624359351696941\
89166494104952290716061467283074936182290043354342490250743196513357\
17212673549917050251415008968860641379683205051131693384563804473690\
89903250787076735030486079965562026803790996240143092356323469108949\
07845535789793689147065546557522839031958055488735322364364131104340\
87166908595780193040642500443843720507588834041902825237277938594405\
15509518379205857942740777802594442587952768392559677020956217589081\
33581899255243337125974255338488475972519745646623282832498001302879\
97096502521398591719712234914109280948285965958478557415471893042412\
37475503214912860261669526560018289994451491775859964901871011582605\
60397222346597960243475095118473975946655966278677300109917282300025\
91196061133760791076031926996915150405486810283942043845416405992441\
04234755865756841005862837001581295460756719973524643902069661443476\
62959323040510506366321536923369435204548433728840037610657060117916\
95218686643134430952095557859827160438074418388249761882170083975363\
49979442098222565355493523403865857324728074954237197296535659279741\
76818946685603045371134466487089332172645850146950210739544416499959\
11699854986965051039658459578956685582448600003884615796462271186448\
17152459906468747763647302588605751100068841061137998158161515420480\
22318750114714645614691409738870229881884135960615681233518827274646\
97655884831524317196513396138604061144709084173127050169960969426206\
52655647298123938967996892262704749484308663243367637233409236395877\
16158838490461567162279329710757545371095127926752344490557461174904\
98350296783037094146290240832642344526951036992128964662104823342328\
54326936859341427510045192855933515626979981579879097216743613480412\
84130111347086326278853018443938104571356527897125855276423431718226\
89948184483964355088191096584486312766049989136785882316252089805747\
91742304600477823361721935335348605132172666045313701386462122287056\
68758394818039394561931249752372359921860235500907703617657471112468\
36477879522500385199108709199684161073878090521604160929658824487302\
08189162141752889696881176598927266776983392249528725466796126147755\
84837319880970278900349346884424939463657023804336647859408884059715\
87012385007902709564185245198547735808691732425755879909338114582324\
96496038851490551258638486060560640779848053453363726807356342381429\
06674320609564946079226292351275912040951590375532791423187152515437\
23852827475892552409562375800889091489173683224781236872060328011056\
48781494027495854853889407259647716580473257070974609155548989153057\
01770396652955542660500535945575049924043963937672274745980419440159\
73418948752580407334724228608155102768772518992593039260939482902676\
68501092440834712572791048398198968952196157492831517190068185690130\
54317743725067865519433837861089656827013210893936524514466555291774\
10118234887021100028017166818373039407738256677074456585169768128673\
31216559786733796846363998574307817534685493308372883174988103338379\
77555499797541822977140710754411379683506825653833835598576632482517\
06897089238260684742063540778170332874744726525838301125226609661605\
75675563264013373771866481097747281315582164490082018932862351602307\
75895597782979091503504290038602647497360256570951098586836413102875\
05985508624511969413092808738775819236922993770908434526013048298335\
63019052110438767283756977888399786317944181262208218736332067033822\
80782661106911795601344763359694750182233827973870280096262437147106\
21345220816897690852045590174605218233496926136141353580378360441714\
28438525233992119663994156239594488313118082243243369528158085152773\
57113378544700236891699113941209103476577008555151150115939550291435\
84311784078501339810706274047644249675445910838212985742593658736726\
05590639634037920190032975704406638099883233645741951743468840995139\
43095512100364094192434566607233069670007691516971518451704593177588\
61282246970125199204660143691272485688286787069963298843415512939777\
64698968031194340606574494573123788121878504563898097114591153445491\
29114045251097051865489399353650136868661016594913875104276448322279\
38508327655772735551923801839886882663244185808346800338063200999436\
16634873166641664790809023885607426914349049558044907612956164113179\
41293084012994951583428416228171274647443699568963003542273548716725\
17543291905739230393954548707021174633513110455440690091273680401298\
12811462097001540223298427500619842282324231271218044540375197467894\
28787621198614457986071126897232495087089840730512905845441161761976\
66601236384032968167285587365409796297465633367726141545996988226883\
94404671657814361092082070588158359501103729598165458354758606730331\
06290923381456909223490297830648362593407596156948828105259159998540\
44681640786318943172502512008752140731673485710162423305221273151554\
14015776177271915335460332199341190875273337813209488648738342833559\
09991792529769222004989099471330675361053443603329769538381697759223\
41487298533047095302388789420693497992006089184803515480781920674919\
09075112364663390621356691339916840245607264564259595373510560189888\
83101152955043503694233791483780253651442819969176278483653883495740\
57348801797457787746560339910935082336178277998340420138860067492161\
80813845520161575668483836859059768645273848264199471732826240207283\
38944009476728508602225488515628992665747699225194127625115010282470\
11255421483010324323934269992453569693406756352625863659617370205694\
46730077567953286193905468530935524416186408220535319273409812154850\
66824640573392916854142528081849040648607925797094385311506922861842\
29144703260429506651941695600111589700081612805239764675960599037895\
85277505278580936871197464961945900796506521050743626513365790652639\
56267995795043841000443373363843188371190067977723895217943323322350\
07131769451445041372122674930206828784929068233225437967059745018449\
24032625385535746044191547305282331530150508122071323064899819715044\
74449341068915282166440199819997121493518122097904841256853371980242\
48686646703791404029568587332628302419928168479892841143897873463368\
91001511064038809491058043208641187299641201395772810244836614652387\
82450681130790331970820232940957236616253594577206591272486713707415\
46948035707641090744852030891472775930637435026651131633879270089970\
44303247964187045852327022149531568482844729534374639100547599878437\
44909519052859764561719309116739679668953213767053177134351858393487\
62085012809637892432061609006747375297605894263669573918014684415911\
77903552657797646888181858087038285788571519109995872900507911395336\
80048417655055459938566880542365695507252330830042280914049463078624\
33569004674540996019078694725685882888530273965971773466249177582671\
42332603208213018856233359235015599937114645865169765977699011678178\
65491574334292488980604052665130357287858832893782241288755784638960\
64397859346154929859372879633696613249742905340795597385119890246889\
08688700640311194450755763678975202118000072107423667977807786843802\
46174591520518486946282593854915117798362423931239232306806408129269\
34708055912784899835082772799444653961132222086104558970432767747437\
88633596584638516429917747859701245255267497057587911651007806899367\
83605388585832329612519332348998714112329223819072111733578428201610\
89626170510883461035298725572775800706131604421628258178566484068261\
12405337928043939969536697946870334885024960249216783531029555972097\
84824503847059549111252017217901866557048983055500910349584743702027\
02270184957931102654283893318223648235659233326009586692028246496989\
60710329892077654990141054083454257677162385124341509709606214608731\
38194953833855587315682647117056971213294828892548510635009164117322\
69780817879546310416971723639232724179897303295326084797910865848021\
71103436912274354278556604836788582218736050493567224034486228435752\
75772545021730137777109249271305862854606379162479556321716369747012\
71058026509724749532545806263542358710308420248483382469723776037376\
69597983214319566329384729274790976182741310687825587124747694271202\
28994119273457226536530124689674046719121694781331975885866944397414\
80646928865900476175693828843742554200741632048619852118186250519076\
96565875277089012694058827876101297079720952161195398523023828552657\
02759085017247705446867703237841506390589295997796632748329114874936\
18403281978323594239953480189136997128462682656651813935559548642366\
91344303935616433794560077739844540045026897632503760550277165908180\
44396002251991670687351585740943555336338404437656999031317379460574\
33644039340780171677598011917530775848363921450425665793612984955814\
46229638868178745565157046098390502162983892266194312085128770307885\
03935195697010019173631562502099772732135285257616457457496713081114\
27698087930936134900937786460750887725990966034866935651522844738350\
67836613842467245269685348426064212529884648042396191418842320783973\
75995845907852205274837621277373785446822230566103387366537283783471\
14733090624177806365035413997882073568160227884766259995743233063926\
04323525627951228768897375175821617018283853002830565486213078784078\
55148910934828741195385502504757742265796083683664577440469374407863\
13164696065151429326584359520661070954626038065790919987475229950740\
98216024048599560662121214665212267608906887699061900342360030929097\
47880445733354436868505653643898815979415719223075312881714224290123\
41103905501102814618080271895011309165724524105825798203841299390522\
77992971203468621022598279406809300700081197021715469489274684846553\
77079370486851280632467450720757578010346887352264309727206549952369\
84430678997862148905029984576703819937255833785743720341765540435032\
96646736240918734453057541699910818180971797288574782433416947290678\
37280856433367250804218572773292211981900395475951298227991332978940\
27393755102812354289041266547584803540291969245091496543899626015439\
20544328434478996976635560702205616427714866475845230636742677201248\
54833564014964014561785203285877295747933756675009893281551897328021\
59418768491468356922541194834374009854217058462894766713723976242506\
33501980447527984054506287725044344865461692582424535946790342675202\
19477924778192351335239273036438409900357736968995545750602630916862\
07145824980908319494765273379305286696173544270167015094012422019461\
10110887380826756316024097822574773598557194695478247372939507966203\
29705858302812528783680908061271781318308853066172524646574009812058\
15491577412033628444516865373536840299843225279750680250512209287726\
70397253246630246078780479868717275510620978044054573646725585953242\
24758324770955832253706182177378996095677809200493210148738421043729\
74274161518014481783974080854045256838062599340760051731497627412507\
47265737662102660719976929447755810323498446434442537511040265468910\
43510141536688577299258638010109977110563868273874430671291803629871\
60935608539042562755537099055435404432514293203203439433284377871436\
57994132148348689166418229564255038075101142940868371859796958043423\
95574667099069464561787497638061471023478899430045148689551367991881\
21566086064763654052446414731052255138606247622016854079994875327395\
48336043017691288701099997564691519979028704748160129306878648076267\
78807483332466167739993336059393439458713795495027130913477137163967\
33212638853761959999943498955186589302026949447117881641750014115955\
94467826370160249803549647393175426177149199867597218424686918372730\
78980944010070138138080752013831258280466907055392611875371510020251\
51800106836557368069728552452932087455607278280099606979325749370959\
32140319187190721617428623235849046666769248377704247592637495100653\
16908459137057163937207803590262164206824274693991636821189334048631\
18357431296069511882893876953799136574175178784539883682427570859050\
23638256014536405998794647819608013589296597158733008320056824712260\
63549827856575391246729828051534769056050117993691741808160658338919\
44097905646550003580396419617760070019081705875625972878012366674679\
89292342512250048487425437546815015357199615226915228311396456047784\
77865786121932224656620803395855540010187517792883116941453865099079\
74150833888842573806312810859398218917803622873821251271610417946046\
11112073413623443045273514117087181250817162098103623756234695368183\
13329645002476824195759384817251427574077359703488481671707390563123\
29489572243615805666703488202767581029534591772098940413043218935233\
90863085401224517659534671566110922764069066448030558833184012635309\
67680593942113177332551690381005762329396768070186525893092140568792\
80842586082843501854057032456535635578215444814064662533030013057318\
31828990009857758497553071226979016224998850520128961978321074633694\
72596920941676631197534864178982171496812717822255063769725399908483\
22022524322928978677204857120686863585204689473401131106177840652358\
37371614145085538055371518912620992897032714169143489868354092566936\
48812775003327275943150768750860164679671172857179015820811479158914\
84122397219348846942722048529334135862982966040930598883092702070297\
01021054310366548846285483053033629752307019066118768758621856563205\
61077991767594949026864421630980354592182135940618795388702995460780\
15244528568453393510567298824950473694304626782827092561451548697001\
58655671621640122290190060058355755003380914495159377600243284934348\
30830232919737374256470494422199465871651077565986045910787485717984\
28640454849863473825645873938305511859807699766243416400243821342352\
03754512014015085697869209666106500847867745391449969654486637962121\
96099816021654477490828443488872228300302385780318236684548929433912\
93263596581118329352900643557103898820061908796437986587441214584341\
11988752772398357121525143168135648985355335801715938044069098157455\
59287951656187194689448537276786455279919640158591032349477352564508\
41323717085310676425390779501257241835410944478913727273009603729220\
56811212360534469226325897656316758584230613651821772025561754082913\
7151881915423973881943439830877204381696

files.

Greetings
Bernd
-- 
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 22:11                           ` Bernd Eckenfels
@ 2003-08-28 23:00                             ` Jamie Lokier
  2003-08-29  0:47                             ` David Schwartz
  1 sibling, 0 replies; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 23:00 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

Bernd Eckenfels wrote:
> Why is that? For a hash with n bits there are at least
> 2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
> Three are even more, if you consider all files up to this size.

Correct.  Now, can you tell me how many of the 2^(y-n) files are ones
you will find on the net or on anybody's hard disk?

I'll take a couple of guesses: either 0 or 1 :)

The strength of a cryptographic hash is due to the immense difficulty
of deliberately creating a file given a hash value.  Another strength
is the very low probability of finding two different files with the
same hash value.

They're a bit like uncomputable numbers.  (Just a little bit).  Lots
of colliding files exist.  But if you can find even one pair, that's
an amazing event.

Uncomputable numbers are similar.  There are infinitely more
uncomputable numbers than computable ones, yet it is impossible to
write one down or even refer to a specific one of them.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file reading
  2003-08-28 21:59                             ` Bernd Eckenfels
@ 2003-08-28 23:02                               ` Jamie Lokier
  2003-08-28 23:44                               ` Lockless file readingu root
  1 sibling, 0 replies; 56+ messages in thread
From: Jamie Lokier @ 2003-08-28 23:02 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

Bernd Eckenfels wrote:
> In article <200308281726.SAA24033@mauve.demon.co.uk> you wrote:
> > I'd be really surprised if there were that many pictures in the world.
> 
> Well, this is about probabilty. It does not mean that you need 2^64
> pictures, neighter does it mean you have a collision within 2^64 pictures.

It just means that if you have a collision with many fewer pictures
than that, it's such an unlikely event that a flaw in the program
calculating the hash, or a flaw in the hash algorithm itself, is more
likely than it being a random collision.

-- Jamie


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file readingu
  2003-08-28 21:59                             ` Bernd Eckenfels
  2003-08-28 23:02                               ` Jamie Lokier
@ 2003-08-28 23:44                               ` root
  2003-08-29 10:00                                 ` jlnance
  1 sibling, 1 reply; 56+ messages in thread
From: root @ 2003-08-28 23:44 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

> 
> In article <200308281726.SAA24033@mauve.demon.co.uk> you wrote:
> > I'd be really surprised if there were that many pictures in the world.
> 
> Well, this is about probabilty. It does not mean that you need 2^64
> pictures, neighter does it mean you have a collision within 2^64 pictures.

Of course it dowesn't.
The probability gets rather smaller as numbers go down, and bigger as
they go up.
With 2^128 bits, the chance of a a collision between 2^64 randomly chosen 
pictures is 50%.
At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
several pictures of everyone alive) one in a billion billion.
At more common numbers of pictures (say 2^14) it becomes vanishingly
unlikely for anyone to have two matching pictures (even with several billion
archives)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file reading
  2003-08-28 22:11                           ` Bernd Eckenfels
  2003-08-28 23:00                             ` Jamie Lokier
@ 2003-08-29  0:47                             ` David Schwartz
  1 sibling, 0 replies; 56+ messages in thread
From: David Schwartz @ 2003-08-29  0:47 UTC (permalink / raw)
  To: Bernd Eckenfels, linux-kernel


> In article <MDEHLPKNGKAHNMBLJOLKEEEEFMAA.davids@webmaster.com> you wrote:

> > Find those GIFs, double-check, and publish immediately.
> > That would be
> > amazingly big news and would probably cause huge numbers of
> > people to switch
> > from MD5 to SHA1 overnight.

> Why is that?

	Because the security of a hash is based upon it being impractical to find
collisions. If you could demonstrate that you could find even a single
collision (by actually doing it), that would reduce confidence in MD5
massively.

> For a hash with n bits there are at least
> 2^y / 2^n = 2^(y-n) files with the same hash, if they have size y bits.
> Three are even more, if you consider all files up to this size.
>
> With n=128 (md5) and y=80000 (10k file) x=2^79872 different files
> with the same
> hash, which is:
[snip]

	So what? You can, in principle, break a 4096-bit RSA key just by brute
force factorization. The security is based upon the difficulty of actually
doing it in practice. The simplicity of doing it in theory is irrelevent.
Yes, in theory you can break any public key encryption scheme or any hash
digest algrithm by brute force trial and error. In practice, however your
children's, children's, children's, children would be long dead before you
got 1% of the way there.

	To date, no full MD5 collisions have been published and until earlier in
this thread, nobody even claimed to have one. And I'll bet $100 to $1 that
the claimant is in error, either because he used only part of the hash or a
hash of the hash, mis-implementing the hash, o actually hashed the same data
twice without realizing it.

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file readingu
  2003-08-28 23:44                               ` Lockless file readingu root
@ 2003-08-29 10:00                                 ` jlnance
  2003-08-29 11:55                                   ` David Schwartz
  2003-08-29 15:43                                   ` William Lee Irwin III
  0 siblings, 2 replies; 56+ messages in thread
From: jlnance @ 2003-08-29 10:00 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 12:44:00AM +0100, root@mauve.demon.co.uk wrote:

> Of course it dowesn't.
> The probability gets rather smaller as numbers go down, and bigger as
> they go up.
> With 2^128 bits, the chance of a a collision between 2^64 randomly chosen 
> pictures is 50%.
> At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
> several pictures of everyone alive) one in a billion billion.
> At more common numbers of pictures (say 2^14) it becomes vanishingly
> unlikely for anyone to have two matching pictures (even with several billion
> archives)

Be careful.  I remember discussing in probability class the liklyhood that
two people in a room with N people have the same birthday.  N does not have
to be anywhere close to 365 for your probability of a collision to be greater
than 50%.  I forget what the exact number is but its less than 30.  The
image problem sounds similar, depending on exactly how you phrase it.

Thanks,

Jim

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: Lockless file readingu
  2003-08-29 10:00                                 ` jlnance
@ 2003-08-29 11:55                                   ` David Schwartz
  2003-08-29 15:43                                   ` William Lee Irwin III
  1 sibling, 0 replies; 56+ messages in thread
From: David Schwartz @ 2003-08-29 11:55 UTC (permalink / raw)
  To: jlnance, root; +Cc: linux-kernel


> On Fri, Aug 29, 2003 at 12:44:00AM +0100, root@mauve.demon.co.uk wrote:

> > Of course it dowesn't.
> > The probability gets rather smaller as numbers go down, and bigger as
> > they go up.
> > With 2^128 bits, the chance of a a collision between 2^64
> > randomly chosen
> > pictures is 50%.
> > At 2^54 pictures, it's about one in a million, and at 2^34 (enough for
> > several pictures of everyone alive) one in a billion billion.
> > At more common numbers of pictures (say 2^14) it becomes vanishingly
> > unlikely for anyone to have two matching pictures (even with
> > several billion
> > archives)

> Be careful.  I remember discussing in probability class the liklyhood that
> two people in a room with N people have the same birthday.  N
> does not have
> to be anywhere close to 365 for your probability of a collision
> to be greater
> than 50%.  I forget what the exact number is but its less than 30.  The
> image problem sounds similar, depending on exactly how you phrase it.

	He's saying that with 2^128 possible hash values, you get a collision with
50% probability with roughly 2^64 pictures. Analogously for the birthday
paradox, with 365 possible birthdays (about 2^8), you get a collision with
50% probability with less than 30 people (about 2^5).

	Now does 5 really seem to be that much less than 8? His math is right and
so is yours. With 2^N possible values, all equally likely, you get a
collision with 50% probability in roughly 2^((N+1)/2).

	You'll note that the birthday paradox doesn't seem so suprising in
exponential notation. In the case of MD5, with
340,282,366,920,938,463,463,374,607,431,768,211,456 possible hashes, you get
a 50% chance of a collision in roughly
12,912,720,851,596,686,131 hashes. So that could seem fairly surprising when
you take it out of exponential notation too. (Just compare the lengths of
the two numbers.)

	DS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Lockless file readingu
  2003-08-29 10:00                                 ` jlnance
  2003-08-29 11:55                                   ` David Schwartz
@ 2003-08-29 15:43                                   ` William Lee Irwin III
  1 sibling, 0 replies; 56+ messages in thread
From: William Lee Irwin III @ 2003-08-29 15:43 UTC (permalink / raw)
  To: jlnance; +Cc: root, linux-kernel

On Fri, Aug 29, 2003 at 06:00:11AM -0400, jlnance@unity.ncsu.edu wrote:
> Be careful.  I remember discussing in probability class the liklyhood that
> two people in a room with N people have the same birthday.  N does not have
> to be anywhere close to 365 for your probability of a collision to be greater
> than 50%.  I forget what the exact number is but its less than 30.  The
> image problem sounds similar, depending on exactly how you phrase it.

This is very simple to see. Consider the probability of not having a
collision instead of having a collision. Each choice eliminates a
choice, so the probability of not colliding while taking a sample after
taking N samples from a uniform distribution over K values is (K-N)/K.
You end up getting 1 - K!/((K-N)! * K**N).

Given one possible value for each day of the year, K == 365.

So you want 1 - 365!/((365 - N)! * 365**N) > 1/2.

The first few probabilities of collision (calculated afresh) are:

0  0.0
1  0.0
2  2.739726027397249e-3
3  8.204165884781345e-3
4  1.6355912466550326e-2
5  2.713557369979358e-2
6  4.0462483649111536e-2
7  5.6235703095975365e-2
8  7.433529235166902e-2
9  9.462383388916673e-2
10 0.11694817771107757
11 0.14114137832173312
12 0.16702478883806438
13 0.19441027523242949
14 0.223102512004973
15 0.25290131976368646
16 0.2836040052528499
17 0.31500766529656066
18 0.34691141787178936
19 0.37911852603153673
20 0.41143838358058005
21 0.4436883351652058
22 0.4756953076625501
23 0.5072972343239854
24 0.5383442579145288
25 0.5686997039694639
26 0.598240820135939
27 0.626859282263242
28 0.6544614723423994
29 0.680968537477777
30 0.7063162427192686
31 0.7304546337286438
32 0.7533475278503207
33 0.774971854175772
34 0.7953168646201543
35 0.8143832388747152
36 0.8321821063798795
37 0.8487340082163846
38 0.8640678210821209
39 0.878219664366722
40 0.891231809817949
41 0.9031516114817354
42 0.9140304715618692
43 0.9239228556561199
44 0.9328853685514263
45 0.940975899465775
46 0.9482528433672547
47 0.9547744028332994
48 0.9605979728794224
49 0.9657796093226765
50 0.9703735795779884

N = 23 should do fine for your purposes.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2003-08-29 15:42 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-27 12:37 Lockless file reading Timo Sirainen
2003-08-27 12:42 ` Martin Konold
2003-08-27 12:52   ` Timo Sirainen
2003-08-27 13:40     ` Richard B. Johnson
2003-08-27 14:56       ` Timo Sirainen
2003-08-27 23:39         ` Jamie Lokier
2003-08-28  0:52           ` Timo Sirainen
2003-08-27 18:42             ` Nagendra Singh Tomar
2003-08-28  8:40               ` Timo Sirainen
2003-08-27 21:15                 ` Nagendra Singh Tomar
2003-08-28  9:35                   ` Timo Sirainen
2003-08-27 21:52                     ` Nagendra Singh Tomar
2003-08-28 13:26                       ` Matthias Andree
2003-08-28  9:17                 ` David Schwartz
2003-08-28  8:42               ` Valdis.Kletnieks
2003-08-28 20:13                 ` David B. Stevens
2003-08-28  1:50             ` Jamie Lokier
2003-08-28  3:17               ` Timo Sirainen
2003-08-28  6:01                 ` Valdis.Kletnieks
2003-08-28  6:13                 ` Jamie Lokier
2003-08-28  8:57                   ` Timo Sirainen
2003-08-28  9:56                     ` David Schwartz
2003-08-28 10:26                       ` Timo Sirainen
2003-08-27 22:58                         ` Nagendra Singh Tomar
2003-08-28 12:18                           ` Jamie Lokier
2003-08-28  0:39                             ` Nagendra Singh Tomar
2003-08-28 13:00                               ` Jamie Lokier
2003-08-28  1:06                                 ` Nagendra Singh Tomar
2003-08-28 21:49                             ` Bernd Eckenfels
2003-08-28 12:01                         ` Jamie Lokier
2003-08-28 13:28                           ` Timo Sirainen
2003-08-28 20:24                         ` David Schwartz
2003-08-28 12:44                       ` Ragnar Hojland Espinosa
2003-08-28 13:03                         ` Jamie Lokier
2003-08-28 17:26                           ` root
2003-08-28 17:35                             ` Jamie Lokier
2003-08-28 18:10                               ` root
2003-08-28 21:59                             ` Bernd Eckenfels
2003-08-28 23:02                               ` Jamie Lokier
2003-08-28 23:44                               ` Lockless file readingu root
2003-08-29 10:00                                 ` jlnance
2003-08-29 11:55                                   ` David Schwartz
2003-08-29 15:43                                   ` William Lee Irwin III
2003-08-28 20:37                         ` Lockless file reading David Schwartz
2003-08-28 22:11                           ` Bernd Eckenfels
2003-08-28 23:00                             ` Jamie Lokier
2003-08-29  0:47                             ` David Schwartz
2003-08-28  9:13                 ` Martin Konold
2003-08-28  9:27                   ` Timo Sirainen
2003-08-28  9:48                     ` Martin Konold
2003-08-28  0:03       ` Jamie Lokier
2003-08-28 12:08         ` Richard B. Johnson
2003-08-28 12:39           ` Jamie Lokier
2003-08-28 10:08   ` Matthias Andree
2003-08-28 10:54 ` Robin Rosenberg
2003-08-28 12:42   ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).