* sys_write() racy for multi-threaded append? @ 2007-03-08 23:08 Michael K. Edwards 2007-03-08 23:43 ` Eric Dumazet 2007-03-09 0:43 ` Alan Cox 0 siblings, 2 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-08 23:08 UTC (permalink / raw) To: Linux Kernel Mailing List from sys_write(): file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } Surely that's racy when two threads write to the same fd and the first vfs_write call blocks or is preempted. Or does fget_light take some per-file lock that I'm not seeing? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-08 23:08 sys_write() racy for multi-threaded append? Michael K. Edwards @ 2007-03-08 23:43 ` Eric Dumazet 2007-03-08 23:57 ` Michael K. Edwards 2007-03-09 0:43 ` Alan Cox 1 sibling, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2007-03-08 23:43 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Linux Kernel Mailing List Michael K. Edwards a écrit : > from sys_write(): > > file = fget_light(fd, &fput_needed); > if (file) { > loff_t pos = file_pos_read(file); > ret = vfs_write(file, buf, count, &pos); > file_pos_write(file, pos); > fput_light(file, fput_needed); > } > > Surely that's racy when two threads write to the same fd and the first > vfs_write call blocks or is preempted. Or does fget_light take some > per-file lock that I'm not seeing? Nothing in the manuals says that write() on same fd should be non racy : In particular file pos might be undefined. There is a reason pwrite() exists. Kernel doesnt have to enforce thread safety as standard is quite clear. Only O_APPEND case is specially handled (and NFS might fail to handle this case correctly) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-08 23:43 ` Eric Dumazet @ 2007-03-08 23:57 ` Michael K. Edwards 2007-03-09 0:15 ` Eric Dumazet 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-08 23:57 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Kernel Mailing List On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > Nothing in the manuals says that write() on same fd should be non racy : In > particular file pos might be undefined. There is a reason pwrite() exists. > > Kernel doesnt have to enforce thread safety as standard is quite clear. I know the standard _allows_ us to crash and burn (well, corrupt f_pos) when two threads race on an fd, but why would we want to? Wouldn't it be better to do something at least slightly sane, like add atomically to f_pos the expected number of number of bytes written, then do the write, then fix it up (again atomically) if vfs_write returns an unexpected pos? > Only O_APPEND case is specially handled (and NFS might fail to handle this > case correctly) Is it? How? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-08 23:57 ` Michael K. Edwards @ 2007-03-09 0:15 ` Eric Dumazet 2007-03-09 0:45 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2007-03-09 0:15 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Linux Kernel Mailing List Michael K. Edwards a écrit : > On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Nothing in the manuals says that write() on same fd should be non racy >> : In >> particular file pos might be undefined. There is a reason pwrite() >> exists. >> >> Kernel doesnt have to enforce thread safety as standard is quite clear. > > I know the standard _allows_ us to crash and burn (well, corrupt > f_pos) when two threads race on an fd, but why would we want to? > Wouldn't it be better to do something at least slightly sane, like add > atomically to f_pos the expected number of number of bytes written, > then do the write, then fix it up (again atomically) if vfs_write > returns an unexpected pos? Absolutely not. We dont want to slow down kernel 'just in case a fool might want to do crazy things' > >> Only O_APPEND case is specially handled (and NFS might fail to handle >> this >> case correctly) > > Is it? How? mm/filemap.c generic_write_checks() if (file->f_flags & O_APPEND) *pos = i_size_read(inode); done while inode is locked. O_APPEND basically says : Just ignore fpos and always use the 'end of file' ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 0:15 ` Eric Dumazet @ 2007-03-09 0:45 ` Michael K. Edwards 2007-03-09 1:34 ` Benjamin LaHaise 2007-03-09 5:53 ` Eric Dumazet 0 siblings, 2 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-09 0:45 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Kernel Mailing List On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > Absolutely not. We dont want to slow down kernel 'just in case a fool might > want to do crazy things' Actually, I think it would make the kernel (negligibly) faster to bump f_pos before the vfs_write() call. Unless fget_light sets fput_needed or the write doesn't complete cleanly, you won't have to touch the file table entry again after vfs_write() returns. You can adjust vfs_write to grab f_dentry out of the file before going into do_sync_write. do_sync_write is done with the struct file before it goes into the aio_write() loop. Result: you probably save at least an L1 cache miss, unless the aio_write loop is so frugal with L1 cache that it doesn't manage to evict the struct file. Patch to follow. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 0:45 ` Michael K. Edwards @ 2007-03-09 1:34 ` Benjamin LaHaise 2007-03-09 12:19 ` Michael K. Edwards 2007-03-09 5:53 ` Eric Dumazet 1 sibling, 1 reply; 31+ messages in thread From: Benjamin LaHaise @ 2007-03-09 1:34 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Eric Dumazet, Linux Kernel Mailing List > Actually, I think it would make the kernel (negligibly) faster to bump > f_pos before the vfs_write() call. Unless fget_light sets fput_needed > or the write doesn't complete cleanly, you won't have to touch the > file table entry again after vfs_write() returns. You can adjust Any number of things can cause a short write to occur, and rewinding the file position after the fact is just as bad. A sane app has to either serialise the writes itself or use a thread safe API like pwrite(). -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <zyntrop@kvack.org>. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 1:34 ` Benjamin LaHaise @ 2007-03-09 12:19 ` Michael K. Edwards 2007-03-09 13:44 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-09 12:19 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Linux Kernel Mailing List On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote: > Any number of things can cause a short write to occur, and rewinding the > file position after the fact is just as bad. A sane app has to either > serialise the writes itself or use a thread safe API like pwrite(). Not on a pipe/FIFO. Short writes there are flat out verboten by 1003.1 unless O_NONBLOCK is set. (Not that f_pos is interesting on a pipe except as a "bytes sent" indicator -- and in the multi-threaded scenario, if you do the speculative update that I'm suggesting, you can't 100% trust it unless you ensure that you are not in mid-read/write in some other thread at the moment you sample f_pos. But that doesn't make it useless.) As to what a "sane app" has to do: it's just not that unusual to write application code that treats a short read/write as a catastrophic error, especially when the fd is of a type that is known never to produce a short read/write unless something is drastically wrong. For instance, I bomb on short write in audio applications where the driver is known to block until enough bytes have been read/written, period. When switching from reading a stream of audio frames from thread A to reading them from thread B, I may be willing to omit app serialization, because I can tolerate an imperfect hand-off in which thread A steals one last frame after thread B has started reading -- as long as the fd doesn't get screwed up. There is no reason for the generic sys_read code to leave a race open in which the same frame is read by both threads and a hardware buffer overrun results later. In short, I'm not proposing that the kernel perfectly serialize concurrent reads and writes to arbitrary fd types. I'm proposing that it not do something blatantly stupid and easily avoided in generic code that makes it impossible for any fd type to guarantee that, after 10 successful pipelined 100-byte reads or writes, f_pos will have advanced by 1000. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 12:19 ` Michael K. Edwards @ 2007-03-09 13:44 ` Eric Dumazet 2007-03-09 14:10 ` Alan Cox 2007-03-09 14:59 ` Benjamin LaHaise 2 siblings, 0 replies; 31+ messages in thread From: Eric Dumazet @ 2007-03-09 13:44 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Benjamin LaHaise, Linux Kernel Mailing List On Friday 09 March 2007 13:19, Michael K. Edwards wrote: > On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Any number of things can cause a short write to occur, and rewinding the > > file position after the fact is just as bad. A sane app has to either > > serialise the writes itself or use a thread safe API like pwrite(). > > Not on a pipe/FIFO. Short writes there are flat out verboten by > 1003.1 unless O_NONBLOCK is set. (Not that f_pos is interesting on a > pipe except as a "bytes sent" indicator -- and in the multi-threaded > scenario, if you do the speculative update that I'm suggesting, you > can't 100% trust it unless you ensure that you are not in > mid-read/write in some other thread at the moment you sample f_pos. > But that doesn't make it useless.) Hello Michael When was the last time you checked standards ? Please read them again, and stop disinforming people. http://www.opengroup.org/onlinepubs/007908775/xsh/write.html "On a file not capable of seeking, writing always takes place starting at the current position. The value of a file offset associated with such a device is undefined." A pipe/FIFO is not capable of seeking. I let you make the conclusion of these two points. A conformant kernel is free to not touch f_pos for non capable seeking files (pipes, sockets, ...), or to put any value in it. Current code does that not because of lazy programmers, but because its generic, and adding special cases (tests + conditional branches) just slow down the code and make it larger. > > As to what a "sane app" has to do: it's just not that unusual to write > application code that treats a short read/write as a catastrophic > error, especially when the fd is of a type that is known never to > produce a short read/write unless something is drastically wrong. For > instance, I bomb on short write in audio applications where the driver > is known to block until enough bytes have been read/written, period. > When switching from reading a stream of audio frames from thread A to > reading them from thread B, I may be willing to omit app > serialization, because I can tolerate an imperfect hand-off in which > thread A steals one last frame after thread B has started reading -- > as long as the fd doesn't get screwed up. There is no reason for the > generic sys_read code to leave a race open in which the same frame is > read by both threads and a hardware buffer overrun results later. Don't assume your app is sane while the kernel is not. It's not very fair : Show us the source code so that we can agree with you or disagree. Also, I've seen some Unixes (namely AIX IBM) that could return a partial write even on a regular file on regular file system. An easy way to trigger this was to launch a debugger/syscall_tracer on the live process while it was doing a big write(). Most 'sane apps' were ignoring the partial return or just throw an exception. Even on 'cleaner Unixes', a write() near the ulimit -f may return a partial count on a regular file. > > In short, I'm not proposing that the kernel perfectly serialize > concurrent reads and writes to arbitrary fd types. I'm proposing that > it not do something blatantly stupid and easily avoided in generic > code that makes it impossible for any fd type to guarantee that, after > 10 successful pipelined 100-byte reads or writes, f_pos will have > advanced by 1000. > Before saying current linux code is "blatantly stupid and easily avoided", just post your patches so that we can check them and eventually say : Oh yes, Michael was right and {we|they} were "stupid" all these years Thank you ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 12:19 ` Michael K. Edwards 2007-03-09 13:44 ` Eric Dumazet @ 2007-03-09 14:10 ` Alan Cox 2007-03-09 14:59 ` Benjamin LaHaise 2 siblings, 0 replies; 31+ messages in thread From: Alan Cox @ 2007-03-09 14:10 UTC (permalink / raw) To: Michael K. Edwards Cc: Benjamin LaHaise, Eric Dumazet, Linux Kernel Mailing List > 1003.1 unless O_NONBLOCK is set. (Not that f_pos is interesting on a > pipe except as a "bytes sent" indicator -- and in the multi-threaded f_pos is undefined on a FIFO or similar object. > As to what a "sane app" has to do: it's just not that unusual to write > application code that treats a short read/write as a catastrophic > error, especially when the fd is of a type that is known never to > produce a short read/write unless something is drastically wrong. For If you are working in a strictly POSIX environment then a signal can interrupt almost any I/O as a short write even disk I/O. In the sane world the file I/O cases don't do this. > as long as the fd doesn't get screwed up. There is no reason for the > generic sys_read code to leave a race open in which the same frame is > read by both threads and a hardware buffer overrun results later. Audio devices are not seekable anyway. > concurrent reads and writes to arbitrary fd types. I'm proposing that > it not do something blatantly stupid and easily avoided in generic > code that makes it impossible for any fd type to guarantee that, after > 10 successful pipelined 100-byte reads or writes, f_pos will have > advanced by 1000. You might want to read up on the Unix design philosophy. Things like record based I/O are user space to avoid kernel complexity and also so that the overhead of these things is paid only by those who need them (its kind of RISC for OS design). Alan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 12:19 ` Michael K. Edwards 2007-03-09 13:44 ` Eric Dumazet 2007-03-09 14:10 ` Alan Cox @ 2007-03-09 14:59 ` Benjamin LaHaise 2007-03-10 6:43 ` Michael K. Edwards 2 siblings, 1 reply; 31+ messages in thread From: Benjamin LaHaise @ 2007-03-09 14:59 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Eric Dumazet, Linux Kernel Mailing List On Fri, Mar 09, 2007 at 04:19:55AM -0800, Michael K. Edwards wrote: > On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote: > >Any number of things can cause a short write to occur, and rewinding the > >file position after the fact is just as bad. A sane app has to either > >serialise the writes itself or use a thread safe API like pwrite(). > > Not on a pipe/FIFO. Short writes there are flat out verboten by > 1003.1 unless O_NONBLOCK is set. (Not that f_pos is interesting on a > pipe except as a "bytes sent" indicator -- and in the multi-threaded > scenario, if you do the speculative update that I'm suggesting, you > can't 100% trust it unless you ensure that you are not in > mid-read/write in some other thread at the moment you sample f_pos. > But that doesn't make it useless.) Writes to a pipe/FIFO are atomic, so long as they fit within the pipe buffer size, while f_pos on a pipe is undefined -- what exactly is the issue here? The semantics you're assuming are not defined by POSIX. Heck, even looking at a man page for one of the *BSDs states "Some devices are incapable of seeking. The value of the pointer associated with such a device is undefined." What part of undefined is problematic? > As to what a "sane app" has to do: it's just not that unusual to write > application code that treats a short read/write as a catastrophic > error, especially when the fd is of a type that is known never to > produce a short read/write unless something is drastically wrong. For > instance, I bomb on short write in audio applications where the driver > is known to block until enough bytes have been read/written, period. > When switching from reading a stream of audio frames from thread A to > reading them from thread B, I may be willing to omit app > serialization, because I can tolerate an imperfect hand-off in which > thread A steals one last frame after thread B has started reading -- > as long as the fd doesn't get screwed up. There is no reason for the > generic sys_read code to leave a race open in which the same frame is > read by both threads and a hardware buffer overrun results later. I hope I don't have to run any of your software. Short writes can and do happen because of a variety of reasons: signals, memory allocation failures, quota being exceeded.... These are all error conditions the kernel has to provide well defined semantics for, as well behaved applications will try to handle them gracefully. > In short, I'm not proposing that the kernel perfectly serialize > concurrent reads and writes to arbitrary fd types. I'm proposing that > it not do something blatantly stupid and easily avoided in generic > code that makes it impossible for any fd type to guarantee that, after > 10 successful pipelined 100-byte reads or writes, f_pos will have > advanced by 1000. The semantics you're looking for are defined for regular files with O_APPEND. Anything else is asking for synchronization that other applications do not require and do not desire. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <zyntrop@kvack.org>. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 14:59 ` Benjamin LaHaise @ 2007-03-10 6:43 ` Michael K. Edwards 0 siblings, 0 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-10 6:43 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Linux Kernel Mailing List I apologize for throwing around words like "stupid". Whether or not the current semantics can be improved, that's not a constructive way to characterize them. I'm sorry. As three people have ably pointed out :-), the particular case of a pipe/FIFO isn't seekable and doesn't need the f_pos member anyway (it's effectively always O_APPEND). That's what I get for checking against standards documents at 3AM. Of course, this has nothing to do with the point that led me to comment on pipes/FIFOs (which was that there exist file types that never return 0<ret<nbytes). And it was in the context of a very explicit aside that f_pos is not _interesting_ on a pipe/FIFO, except as an indicator of total bytes written. You could only peek at this with an (admittedly non-portable) llseek(fd, 0, SEEK_CUR) anyway -- which you would only do for diagnostic purposes. But diagnosis of odd corner cases (rarely in my code, usually in other people's) is what I do day in and day out, so for me it would be worth having. In any case, you're all right that the standard doesn't require you to do anything useful with f_pos on a pipe/FIFO. But you're permitted to make it useful if you want to: <1003.1 lseek()> The behavior of lseek() on devices which are incapable of seeking is implementation-defined. The value of the file offset associated with such a device is undefined. </1003.1> Tracking f_pos accurately when writes from multiple threads hit the same fd (pipe or not) isn't portable, but I recall situations where it would have been useful. And if f_pos has to be kept at all in the uncontended case, it costs you little or nothing to do it in a thread-safe manner -- as long as you don't overconstrain the semantics such that you forbid the transient overshoot associated with a short write. In fact, unless there's something I've missed, increasing f_pos before entering vfs_write() happens to be _faster_ than the current code for common load patterns, both single- and multi-threaded (although getting the full benefit in the multi-threaded case will take some fiddling with f_count placement). I say it costs "little or nothing" only because altering an loff_t atomically is not free. But even on x86, with its inability to atomically modify any 64-bit entity in memory, an uncontended spinlock on a cacheline already in L1 is so cheap that making the f_pos changes atomic will (I think) be lost in the noise. In any case, rewriting read_write.c is proving interesting. I'll let you all know if anything comes of it. In the meantime, thanks for your (really quite friendly under the circumstances) comments. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 0:45 ` Michael K. Edwards 2007-03-09 1:34 ` Benjamin LaHaise @ 2007-03-09 5:53 ` Eric Dumazet 2007-03-09 11:52 ` Michael K. Edwards 1 sibling, 1 reply; 31+ messages in thread From: Eric Dumazet @ 2007-03-09 5:53 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Linux Kernel Mailing List Michael K. Edwards a écrit : > On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Absolutely not. We dont want to slow down kernel 'just in case a fool >> might >> want to do crazy things' > > Actually, I think it would make the kernel (negligibly) faster to bump > f_pos before the vfs_write() call. Unless fget_light sets fput_needed > or the write doesn't complete cleanly, you won't have to touch the > file table entry again after vfs_write() returns. You can adjust > vfs_write to grab f_dentry out of the file before going into > do_sync_write. do_sync_write is done with the struct file before it > goes into the aio_write() loop. Result: you probably save at least an > L1 cache miss, unless the aio_write loop is so frugal with L1 cache > that it doesn't manage to evict the struct file. > > Patch to follow. Dont even try, you *cannot* do that, without breaking the standards, or without a performance drop. The only safe way would be to lock the file during the whole read()/write() syscall, and we dont want this (this would be more expensive than current) Dont forget 'file' may be some sockets/tty/whatever, not a regular file. Standards are saying : If an error occurs, file pointer remains unchanged. You cannot know for sure how many bytes will be written, since write() can returns a count that is different than buflen. So you cannot update fpos before calling vfs_write() About your L1 'miss', dont forget that multi-threaded apps are going to atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of syscall. And you were concerned about multi-threaded apps, didnt you ? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-09 5:53 ` Eric Dumazet @ 2007-03-09 11:52 ` Michael K. Edwards 0 siblings, 0 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-09 11:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Kernel Mailing List On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > Dont even try, you *cannot* do that, without breaking the standards, or > without a performance drop. > > The only safe way would be to lock the file during the whole read()/write() > syscall, and we dont want this (this would be more expensive than current) > Dont forget 'file' may be some sockets/tty/whatever, not a regular file. I'm not trying to provide full serialization of concurrent multi-threaded read()/write() in all exceptional scenarios. I am trying to think through the semantics of pipelined I/O operations on struct file. In the _absence_ of an exception, something sane should happen -- and when you start at f_pos == 1000 and write 100 bytes each from two threads (successfully), leaving f_pos at 1100 is not sane. > Standards are saying : > > If an error occurs, file pointer remains unchanged. >From 1003.1, 2004 edition: <standard> This volume of IEEE Std 1003.1-2001 does not specify the value of the file offset after an error is returned; there are too many cases. For programming errors, such as [EBADF], the concept is meaningless since no file is involved. For errors that are detected immediately, such as [EAGAIN], clearly the pointer should not change. After an interrupt or hardware error, however, an updated value would be very useful and is the behavior of many implementations. This volume of IEEE Std 1003.1-2001 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control. </standard> The effect on f_pos of an error during concurrent writes is therefore doubly unconstrained. In the absence of concurrent writes, it is quite harmless for f_pos to have transiently contained, at some point during the execution of write(), an overestimate of the file position after the write(). In the presence of concurrent writes (let us say two 100-byte writes to a file whose f_pos starts at 1000), it is conceivable that the second write will succeed at f_pos == 1100 but the first will be short (let us say only 50 bytes are written), leaving f_pos at 1150 and no bytes written in the range 1050 to 1099. That would suck -- but the standard does not oblige you to avoid it unless the destination is a pipe or FIFO with O_NONBLOCK clear, in which case partial writes are flat out verboten. > You cannot know for sure how many bytes will be written, since write() can > returns a count that is different than buflen. Of course (except in the pipe/FIFO case). But if it does, and you're writing concurrently to the fd, you're not obliged to do anything sane at all. If you're not writing concurrently, the fact that you overshot and then fixed it up after vfs_write() returned is _totally_ invisible. f_pos is local to the struct file. > So you cannot update fpos before calling vfs_write() You can speculatively update it, and in the common case you don't have to touch it again. That's a win. > About your L1 'miss', dont forget that multi-threaded apps are going to > atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of > syscall. And you were concerned about multi-threaded apps, didnt you ? That does indeed interfere with the optimization for multi-threaded apps. Which doesn't mean it isn't worth having for single-threaded apps. And if we get to the point that that atomic_dec_and_test is the only thing left (in the common case) that touches the struct file after a VFS operation completes, then we can evaluate whether f_count ought to be split out of the struct file and put somewhere else. In fact, if I understand the calls inside vfs_write() correctly, f_count is (usually?) the only member of struct file that is written to during a call to sys_pwrite64(); so moving it out of struct file and into some place where it has to be kept cache-coherent anyway would also improve the performance on SMP of distributed pwrite() calls to a process-global fd. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-08 23:08 sys_write() racy for multi-threaded append? Michael K. Edwards 2007-03-08 23:43 ` Eric Dumazet @ 2007-03-09 0:43 ` Alan Cox 1 sibling, 0 replies; 31+ messages in thread From: Alan Cox @ 2007-03-09 0:43 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Linux Kernel Mailing List > Surely that's racy when two threads write to the same fd and the first > vfs_write call blocks or is preempted. Or does fget_light take some > per-file lock that I'm not seeing? I think you are making assumptions about file position behaviour that are simply not guaranteed in the standards document. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <7WzUo-1zl-21@gated-at.bofh.it>]
[parent not found: <7WAx2-2pg-21@gated-at.bofh.it>]
[parent not found: <7WAGF-2Bx-9@gated-at.bofh.it>]
[parent not found: <7WB07-3g5-33@gated-at.bofh.it>]
[parent not found: <7WBt7-3SZ-23@gated-at.bofh.it>]
* Re: sys_write() racy for multi-threaded append? [not found] ` <7WBt7-3SZ-23@gated-at.bofh.it> @ 2007-03-12 7:53 ` Bodo Eggert 2007-03-12 16:26 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Bodo Eggert @ 2007-03-12 7:53 UTC (permalink / raw) To: Michael K. Edwards, Eric Dumazet, Linux Kernel Mailing List Michael K. Edwards <medwards.linux@gmail.com> wrote: > On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Absolutely not. We dont want to slow down kernel 'just in case a fool might >> want to do crazy things' > > Actually, I think it would make the kernel (negligibly) faster to bump > f_pos before the vfs_write() call. This is a security risk. ---------------- other process: unlink(secrest_file) Thread 1: write(fd, large) (interrupted) Thread 2: fseek(fd, -n, relative) read(fd, buf) ---------------- BTW: The best thing you can do to a program where two threads race for writing one fd is to let it crash and burn in the most spectacular way allowed without affecting the rest of the system, unless it happens to be a pipe and the number of bytes written is less than PIPE_MAX. -- The secret of the universe is #@*%! NO CARRIER Friß, Spammer: dyIw3Rs@x.7eggert.dyndns.org PLxmr@lv.7eggert.dyndns.org HmiJuSaiuF@b.7eggert.dyndns.org rKjmsxE@7eggert.dyndns.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-12 7:53 ` Bodo Eggert @ 2007-03-12 16:26 ` Michael K. Edwards 2007-03-12 18:48 ` Bodo Eggert 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-12 16:26 UTC (permalink / raw) To: 7eggert; +Cc: Eric Dumazet, Linux Kernel Mailing List On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote: > Michael K. Edwards <medwards.linux@gmail.com> wrote: > > Actually, I think it would make the kernel (negligibly) faster to bump > > f_pos before the vfs_write() call. > > This is a security risk. > > ---------------- > other process: > unlink(secrest_file) > > Thread 1: > write(fd, large) > (interrupted) > > Thread 2: > fseek(fd, -n, relative) > read(fd, buf) > ---------------- I don't entirely follow this. Which process is supposed to be secure here, and what information is supposed to be secret from whom? But in any case, are you aware that f_pos is clamped to the inode's max_bytes, not to the current file size? Thread 2 can seek/read past the position at which Thread 1's write began, whether f_pos is bumped before or after the vfs_write. > BTW: The best thing you can do to a program where two threads race for > writing one fd is to let it crash and burn in the most spectacular way > allowed without affecting the rest of the system, unless it happens to > be a pipe and the number of bytes written is less than PIPE_MAX. That's fine when you're doing integration test, and should probably be the default during development. But if the race is first exposed in the field, or if the developer is trying to concentrate on a different problem, "spectacular crash and burn" may do more harm than good. It's easy enough to refactor the f_pos handling in the kernel so that it all goes through three or four inline accessor functions, at which point you can choose your trade-off between speed and idiot-proofness -- at _kernel_ compile time, or (given future hardware that supports standardized optionally-atomic-based-on-runtime-flag operations) per process at run-time. Frankly, I think that unless application programmers poke at some sort of magic "I promise to handle short writes correctly" bit, write() should always return either the full number of bytes requested or an error code. If they do poke at that bit, the (development) kernel should deliberately split writes a few percent of the time, just to exercise the short-write code paths. And in order to find out where that magic bit is, they should have to read the kernel code or ask on LKML (and get the "standard lecture"). Really very IEEE754-like, actually. (Harp, harp.) Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-12 16:26 ` Michael K. Edwards @ 2007-03-12 18:48 ` Bodo Eggert 2007-03-13 0:46 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Bodo Eggert @ 2007-03-12 18:48 UTC (permalink / raw) To: Michael K. Edwards; +Cc: 7eggert, Eric Dumazet, Linux Kernel Mailing List On Mon, 12 Mar 2007, Michael K. Edwards wrote: > On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote: > > Michael K. Edwards <medwards.linux@gmail.com> wrote: > > > Actually, I think it would make the kernel (negligibly) faster to bump > > > f_pos before the vfs_write() call. > > > > This is a security risk. <snip> > I don't entirely follow this. I got it wrong, and I didn't CC myself so I couldn't correct it in time. If it was a security risk, it would exist right now, too, because of other access patterns. > > BTW: The best thing you can do to a program where two threads race for > > writing one fd is to let it crash and burn in the most spectacular way > > allowed without affecting the rest of the system, unless it happens to > > be a pipe and the number of bytes written is less than PIPE_MAX. > > That's fine when you're doing integration test, and should probably be > the default during development. But if the race is first exposed in > the field, or if the developer is trying to concentrate on a different > problem, "spectacular crash and burn" may do more harm than good. > It's easy enough to refactor the f_pos handling in the kernel so that > it all goes through three or four inline accessor functions, at which > point you can choose your trade-off between speed and idiot-proofness > -- at _kernel_ compile time, or (given future hardware that supports > standardized optionally-atomic-based-on-runtime-flag operations) per > process at run-time. CONFIG_WOMBAT Waste memory, brain and time in order to grant an atomic write which is neither guaranteed by the standard nor expected by any sane programmer, just in case some idiot tries to write to one file from multiple processes. Warning: Programs expecting this behaviour are buggy and non-portable. > Frankly, I think that unless application programmers poke at some sort > of magic "I promise to handle short writes correctly" bit, write() > should always return either the full number of bytes requested or an > error code. If you asume that you won't have short writes, your programs may fail on e.g. solaris. There may be reasons for linux to use the same semantics at some time in the future, you never know. If you asume you *may* have short writes, you have no problem. > If they do poke at that bit, the (development) kernel > should deliberately split writes a few percent of the time, just to > exercise the short-write code paths. And in order to find out where > that magic bit is, they should have to read the kernel code or ask on > LKML (and get the "standard lecture"). I think it may be well-hidden in menuconfig. > Really very IEEE754-like, actually. (Harp, harp.) -EYOULOSTME -- "Try to look unimportant; they may be low on ammo." -Infantry Journal ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-12 18:48 ` Bodo Eggert @ 2007-03-13 0:46 ` Michael K. Edwards 2007-03-13 2:24 ` Alan Cox 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-13 0:46 UTC (permalink / raw) To: Bodo Eggert; +Cc: Eric Dumazet, Linux Kernel Mailing List On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote: > On Mon, 12 Mar 2007, Michael K. Edwards wrote: > > That's fine when you're doing integration test, and should probably be > > the default during development. But if the race is first exposed in > > the field, or if the developer is trying to concentrate on a different > > problem, "spectacular crash and burn" may do more harm than good. > > It's easy enough to refactor the f_pos handling in the kernel so that > > it all goes through three or four inline accessor functions, at which > > point you can choose your trade-off between speed and idiot-proofness > > -- at _kernel_ compile time, or (given future hardware that supports > > standardized optionally-atomic-based-on-runtime-flag operations) per > > process at run-time. > > CONFIG_WOMBAT > > Waste memory, brain and time in order to grant an atomic write which is > neither guaranteed by the standard nor expected by any sane programmer, > just in case some idiot tries to write to one file from multiple > processes. > > Warning: Programs expecting this behaviour are buggy and non-portable. OK, I laughed out loud at this. But I think you're missing my point, which is that there's a time to be hard-core about code quality and there's a time to be hard-core about _product_ quality. Face it, all products containing software more or less suck. This is because most programmers write crap code most of the time. The only way to cope with this, outside the confines of the European defense industry and other niches insulated from economic reality, is to make the production environment gentler on _application_ code than the development environment is. Hence CONFIG_WOMBAT. (I like that name. I'm going to use it in my patch, with your permission. :-) Writing to a file from multiple processes is not usually the problem. Writing to a common "struct file" from multiple threads is. 99.999% of the time it will work, because you're only writing as far as VFS cache and then bumping f_pos, and your threads are probably on the same processor anyway. 0.001% of the time the second thread will see a stale f_pos and clobber the first write. This is true even on file types that can never return a short write. If you remember to open with O_APPEND so the pos argument to vfs_write is silently ignored, or if the implementation underlying vfs_write effectively ignores the pos argument irrespective of flags, you're OK. If the pos argument isn't ignored, or if you ever look at the result of a relative seek on any fd that maps to that struct file, you're screwed. (Note to the alert reader: yes, this means shell scripts should always use >> rather than > when routing stdout and/or stderr to a file. You're just as vulnerable to interleaving due to stdio buffering issues as you are when stdio and stderr are sent to the tty, and short writes may still be a problem if you are so foolish as to use a filesystem that generates them on anything short of a catastrophic error, but at least you get O_APPEND and sane behavior on ftruncate().) > > Frankly, I think that unless application programmers poke at some sort > > of magic "I promise to handle short writes correctly" bit, write() > > should always return either the full number of bytes requested or an > > error code. > > If you asume that you won't have short writes, your programs may fail on > e.g. solaris. There may be reasons for linux to use the same semantics at > some time in the future, you never know. So what? My products are shipping _now_. Future kernels are guaranteed to break them anyway because sysfs is a moving target. Solaris is so not in the game for my kind of embedded work, it's not even funny. If POSIX mandates stupid shit, and application programmers don't read that part of the manual anyway (and don't code on that assumption in practice), to hell with POSIX. On many file descriptors, short writes simply can't happen -- and code that purports to handle short writes but has never been exercised is arguably worse than code that simply bombs on short write. So if I can't shim in an induce-short-writes-randomly-on-purpose mechanism during development, I don't want short writes in production, period. In my world, GNU/Linux is not a crappy imitation Solaris that you get to pay out the wazoo for to Red Hat (and get no documentation and lousy tech support that doesn't even cover your hardware). It's a full-source-code platform on which you can engineer robust industrial and consumer products, because you can control the freeze and release schedule component-by-component, and you can point fix anything in the system at any time. If, that is, you understand that the source code is not the software, and that you can't retrofit stability and security overnight onto code that was written with no thought of anything but performance. > If you asume you *may* have short writes, you have no problem. Sure -- until the one code path in a hundred that handles the "short write" case incorrectly gets traversed in production, after having gone untested in a development environment that used a different filesystem that never happened to trigger it. > > If they do poke at that bit, the (development) kernel > > should deliberately split writes a few percent of the time, just to > > exercise the short-write code paths. And in order to find out where > > that magic bit is, they should have to read the kernel code or ask on > > LKML (and get the "standard lecture"). > > I think it may be well-hidden in menuconfig. > > > Really very IEEE754-like, actually. (Harp, harp.) > > -EYOULOSTME I have been harping in other threads on the desirability of an AIO model that resembles IEEE754 floating point in some key respects that would make it practical to implement AEIOUs (asynchronously executed I/O units) in future hardware. That means revamping the calling conventions and exception semantics to reflect what latitude the implementors of pipelined AIO coprocessors are likely to need, not the designed-by-committee, implemented-by-accretion POSIX heritage. My advocacy in mere human language has so far resulted in -ENOPATCH, so I'm having the alternate joy and misery of rewriting fs/read_write.c and all its friends. I'm learning quite a lot about Linux internals in the process, too -- at multiple levels, coding and design and community dynamics. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 0:46 ` Michael K. Edwards @ 2007-03-13 2:24 ` Alan Cox 2007-03-13 7:25 ` Michael K. Edwards 2007-03-13 14:00 ` David M. Lloyd 0 siblings, 2 replies; 31+ messages in thread From: Alan Cox @ 2007-03-13 2:24 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List > Writing to a file from multiple processes is not usually the problem. > Writing to a common "struct file" from multiple threads is. Not normally because POSIX sensibly invented pread/pwrite. Forgot preadv/pwritev but they did the basics and end of problem > So what? My products are shipping _now_. That doesn't inspire confidence. > even funny. If POSIX mandates stupid shit, and application > programmers don't read that part of the manual anyway (and don't code > on that assumption in practice), to hell with POSIX. On many file Thats funny, you were talking about quality a moment ago. > descriptors, short writes simply can't happen -- and code that There is almost no descriptor this is true for. Any file I/O can and will end up short on disk full or resource limit exceeded or quota exceeded or NFS server exploded or ... And on the device side about the only thing with the vaguest guarantees is pipe(). > purports to handle short writes but has never been exercised is > arguably worse than code that simply bombs on short write. So if I > can't shim in an induce-short-writes-randomly-on-purpose mechanism > during development, I don't want short writes in production, period. Easy enough to do and gcov plus dejagnu or similar tools will let you coverage analyse the resulting test set and replay it. > Sure -- until the one code path in a hundred that handles the "short > write" case incorrectly gets traversed in production, after having > gone untested in a development environment that used a different > filesystem that never happened to trigger it. Competent QA and testing people test all the returns in the manual as well as all the returns they can find in the code. See ptrace(2) if you don't want to do a lot of relinking and strace for some useful worked examples of syscall hooking. Alan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 2:24 ` Alan Cox @ 2007-03-13 7:25 ` Michael K. Edwards 2007-03-13 7:42 ` David Miller 2007-03-13 13:15 ` Alan Cox 2007-03-13 14:00 ` David M. Lloyd 1 sibling, 2 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-13 7:25 UTC (permalink / raw) To: Alan Cox; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List On 3/12/07, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > Writing to a file from multiple processes is not usually the problem. > > Writing to a common "struct file" from multiple threads is. > > Not normally because POSIX sensibly invented pread/pwrite. Forgot > preadv/pwritev but they did the basics and end of problem pread/pwrite address a miniscule fraction of lseek+read(v)/write(v) use cases -- a fraction that someone cared about strongly enough to get into X/Open CAE Spec Issue 5 Version 2 (1997), from which it propagated into UNIX98 and thence into POSIX.2 2001. The fact that no one has bothered to implement preadv/pwritev in the decade since pread/pwrite entered the Single UNIX standard reflects the rarity with which they appear in general code. Life is too short to spend it rewriting application code that uses readv/writev systematically, especially when that code is going to ship inside a widget whose kernel you control. > > So what? My products are shipping _now_. > > That doesn't inspire confidence. Oh, please. Like _your_ employer is the poster child for code quality. The cheap shot is also irrelevant to the point that I was making, which is that sometimes portability simply doesn't matter and the right thing to do is to firm up the semantics of the filesystem primitives from underneath. > > even funny. If POSIX mandates stupid shit, and application > > programmers don't read that part of the manual anyway (and don't code > > on that assumption in practice), to hell with POSIX. On many file > > Thats funny, you were talking about quality a moment ago. Quality means the devices you ship now keep working in the field, and the probable cost of later rework if the requirements change does not exceed the opportunity cost of over-engineering up front. Economy gets a look-in too, and says that it's pointless to delay shipment and bloat the application coding for cases that can't happen. If POSIX says that any and all writes (except small pipe/FIFO writes, whatever) can return a short byte count -- but you know damn well you're writing to a device driver that never, ever writes short, and if it did you'd miss a timing budget recovering from it anyway -- to hell with POSIX. And if you want to build a test jig for this code that uses pipes or dummy files in place of the device driver, that test jig should never, ever write short either. > > descriptors, short writes simply can't happen -- and code that > > There is almost no descriptor this is true for. Any file I/O can and will > end up short on disk full or resource limit exceeded or quota exceeded or > NFS server exploded or ... Not on a properly engineered widget, it won't. And if it does, and the application isn't coded to cope in some way totally different from an infinite retry loop, then you might as well signal the exception condition using whatever mechanism is appropriate to the API (-EWHATEVER, SIGCRISIS, or block until some other process makes room). And in any case files on disk are the least interesting kind of file descriptor in an embedded scenario -- devices and pipes and pollfds and netlink sockets are far more frequent read/write targets. > And on the device side about the only thing with the vaguest guarantees > is pipe(). Guaranteed by the standard, sure. Guaranteed by the implementation, as long as you write in the size blocks that the device is expecting? Lots of devices -- ALSA's OSS PCM emulation, most AF_LOCAL and AF_NETLINK sockets, almost any "character" device with a record-structured format. A short write to any of these almost certainly means the framing is screwed and you need to close and reopen the device. Not all of these are exclusively O_APPEND situations, and there's no reason on earth not to thread-safe the f_pos handling so that an application and filesystem/driver can agree on useful lseek() semantics. > > purports to handle short writes but has never been exercised is > > arguably worse than code that simply bombs on short write. So if I > > can't shim in an induce-short-writes-randomly-on-purpose mechanism > > during development, I don't want short writes in production, period. > > Easy enough to do and gcov plus dejagnu or similar tools will let you > coverage analyse the resulting test set and replay it. Here we agree. Except that I've rarely seen embedded application code that wouldn't explode in my face if I tried it. Databases yes, and the better class of mail and web servers, and relatively mature scripting languages and bytecode interpreters; but the vast majority of working programmers in these latter days do not exercise this level of discipline. > > Sure -- until the one code path in a hundred that handles the "short > > write" case incorrectly gets traversed in production, after having > > gone untested in a development environment that used a different > > filesystem that never happened to trigger it. > > Competent QA and testing people test all the returns in the manual as > well as all the returns they can find in the code. See ptrace(2) if you > don't want to do a lot of relinking and strace for some useful worked > examples of syscall hooking. Even in the "enterprise" space, most of the QA and testing people I have dealt with couldn't hook a syscall if their children were starving and the fish were only biting on syscalls. The embedded situation is even worse. ltrace didn't work on ARM for years and hardly anyone _noticed_, let alone did anything about it. (I posted a fix to ltrace-devel a month ago, but evidently the few fish left in that river don't bite on hooked syscalls.) strace maintenance doesn't seem too healthy either, judging by what I had to do it in order to get it to recognize ALSA ioctls and the quasi-syscalls involved in ARM TLS. But on that note -- do you have any idea how one might get ltrace to work on a multi-threaded program, or how one might enhance it to instrument function calls from one shared library to another? Or better yet, can you advise me on how to induce gdbserver to stream traces of library/syscall entry/exits for all the threads in a process? And then how to cram it down into the kernel so I don't take the hit for an MMU context switch every time I hit a syscall or breakpoint in the process under test? That would be a really useful tool for failure analysis in embedded Linux, doubly so on multi-core chips, especially if it could be made minimally intrusive on the CPU(s) where the application is running. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 7:25 ` Michael K. Edwards @ 2007-03-13 7:42 ` David Miller 2007-03-13 16:24 ` Michael K. Edwards 2007-03-13 13:15 ` Alan Cox 1 sibling, 1 reply; 31+ messages in thread From: David Miller @ 2007-03-13 7:42 UTC (permalink / raw) To: medwards.linux; +Cc: alan, 7eggert, dada1, linux-kernel From: "Michael K. Edwards" <medwards.linux@gmail.com> Date: Mon, 12 Mar 2007 23:25:48 -0800 > Quality means the devices you ship now keep working in the field, and > the probable cost of later rework if the requirements change does not > exceed the opportunity cost of over-engineering up front. Economy > gets a look-in too, and says that it's pointless to delay shipment and > bloat the application coding for cases that can't happen. If POSIX > says that any and all writes (except small pipe/FIFO writes, whatever) > can return a short byte count -- but you know damn well you're writing > to a device driver that never, ever writes short, and if it did you'd > miss a timing budget recovering from it anyway -- to hell with POSIX. You're not even safe over standard output, simply run the program over ssh and you suddenly have socket semantics to deal with. In the early days the fun game to play was to run programs over rsh to see in what amusing way they would explode. ssh has replaced rsh in this game, but the bugs have largely stayed the same. Even early versions of tar used to explode on TCP half-closes and whatnot. In short, if you don't handle short writes, you're writing a program for something other than unix. We're not changing write() to interlock with other parallel callers or messing with the f_pos semantics in such cases, that's stupid, please cope, kthx. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 7:42 ` David Miller @ 2007-03-13 16:24 ` Michael K. Edwards 2007-03-13 17:59 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-13 16:24 UTC (permalink / raw) To: David Miller; +Cc: alan, 7eggert, dada1, linux-kernel On 3/13/07, David Miller <davem@davemloft.net> wrote: > You're not even safe over standard output, simply run the program over > ssh and you suddenly have socket semantics to deal with. I'm intimately familiar with this one. Easily worked around by piping the output through cat or tee. Not that one should ever write code for a *nix box that can't cope with stdout being a socket or tty; but sometimes the quickest way to glue existing code into a pipeline is to pass /dev/stdout in place of a filename, and there's no real value in reworking legacy code to handle short writes when you can just drop in cat. > In short, if you don't handle short writes, you're writing a program > for something other than unix. Right in one. You're writing a program for a controlled environment, whether it's a Linux-only API (netlink sockets) or a Linux-only embedded product. And there's no need for that controlled environment to gratuitously reproduce the overly vague semantics of the *nix zoo. > We're not changing write() to interlock with other parallel callers or > messing with the f_pos semantics in such cases, that's stupid, please > cope, kthx. This is not what I am proposing, and certainly not what I'm implementing. Right now f_pos handling is gratuitously thread-unsafe even in the common, all writes completed normally case. writev() opens the largest possible window to f_pos corruption by delaying the f_pos update until after the vfs_writev() completes. That's the main thing I want to see fixed. _If_ it proves practical to wrap accessor functions around f_pos accesses, and _if_ accesses to f_pos from outside read_write.c can be made robust against transient overshoot, then there is no harm in tightening up f_pos semantics so that successful pipelined writes to the same file don't collide so easily. If read-modify-write on f_pos were atomic, it would be easy to guarantee that it is in a sane state any time a syscall is not actually outstanding on that struct file. There's even a potential performance gain from streamlining the pattern of L1 cache usage in the common case, although I expect it's negligible. Making read-modify-write atomic on f_pos is of course not free on all platforms. But once you have the accessors, you can decide at kernel compile time whether or not to interlock them with spinlocks and memory barriers or some such. Most platforms probably needn't bother; there are much bigger time bombs lurking elsewhere in the kernel. x86 is a bit dicey, because there's no way to atomically update a 64-bit loff_t in memory, and so in principle you could get a race on carry across the 32-bit boundary. (This file instantaneously jumped from 4GB to 8GB in size? What the hell? How long would it take you to spot a preemption between writing out the upper and lower halves of a 64-bit integer?) In any case, I have certainly gotten the -ENOPATCH message by now. It must be nice not to have to care whether things work in the field if you can point at something stupid that an application programmer did. I don't always have that luxury. Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 16:24 ` Michael K. Edwards @ 2007-03-13 17:59 ` Michael K. Edwards 2007-03-13 19:09 ` Christoph Hellwig 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-13 17:59 UTC (permalink / raw) To: David Miller; +Cc: alan, 7eggert, dada1, linux-kernel Clearly f_pos atomicity has been handled differently in the not-so-distant past: http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg01628.html And equally clearly the current generic_file_llseek semantics are erroneous for large offsets, and we shouldn't be taking the inode mutex in any case other than SEEK_END: http://marc.theaimsgroup.com/?l=linux-fsdevel&m=100584441922835&w=2 read_write.c is a perfect example of the relative amateurishness of parts of the Linux kernel. It has probably gone through a dozen major refactorings in fifteen years but has never acquired a "theory of operations" section justifying the errors returned in terms of standards and common usage patterns. It doesn't have assertion-style precondition/postcondition checks. It's full of bare constants and open-coded flag tests. It exposes sys_* and vfs_* and generic_file_* without any indication of which sanity checks you're bypassing when you call the inner functions directly. It leaks performance right and left with missing likely() macros and function calls that can't be inlined because there's no interface/implementation split. And then you want to tell me that it costs too much to spinlock around f_pos updates when file->f_count > 1? Give me a break. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 17:59 ` Michael K. Edwards @ 2007-03-13 19:09 ` Christoph Hellwig 2007-03-13 23:40 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Christoph Hellwig @ 2007-03-13 19:09 UTC (permalink / raw) To: Michael K. Edwards; +Cc: David Miller, alan, 7eggert, dada1, linux-kernel Michael, please stop spreading this utter bullshit _now_. You're so full of half-knowledge that it's not funny anymore, and you try to insult people knowing a few magniutes more than you left and right. Can you please select a different mailinglist for your trolling? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 19:09 ` Christoph Hellwig @ 2007-03-13 23:40 ` Michael K. Edwards 2007-03-14 0:09 ` Michael K. Edwards 0 siblings, 1 reply; 31+ messages in thread From: Michael K. Edwards @ 2007-03-13 23:40 UTC (permalink / raw) To: Christoph Hellwig, Michael K. Edwards, David Miller, alan, 7eggert, dada1, linux-kernel On 3/13/07, Christoph Hellwig <hch@infradead.org> wrote: > Michael, please stop spreading this utter bullshit _now_. You're so > full of half-knowledge that it's not funny anymore, and you try > to insult people knowing a few magniutes more than you left and right. Thank you Christoph for that informative response to my comments. I take it that you consider read_write.c to be code of the highest quality and maintainability. If you have something specific in mind when you write "utter bullshit" and "half-knowledge", I'd love to hear it. Now, for those who still care to respond as if improving the kernel were a goal that you and I can share, a question: When generic_file_llseek needs the inode in order to retrieve the current file size, it goes through f_mapping (the pagecache entry?) rather than through f_path.dentry (the dentry cache?). All other inode retrievals in read_write.c go through f_path.dentry. Why? Or is this a question that can only be asked on linux-fsdevel? Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 23:40 ` Michael K. Edwards @ 2007-03-14 0:09 ` Michael K. Edwards 0 siblings, 0 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-14 0:09 UTC (permalink / raw) To: Christoph Hellwig, Michael K. Edwards, David Miller, alan, 7eggert, dada1, linux-kernel In case anyone cares, this is a snippet of my work-in-progress read_write.c illustrating how I might handle f_pos. Can anyone point me to data showing whether it's worth avoiding the spinlock when the "struct file" is not shared between threads? (In my world, correctness comes before code-bumming as long as the algorithm scales properly, and there are a fair number of corner cases to think through -- although one might be able to piggy-back on the logic in fget_light.) Cheers, - Michael /* * Synchronization of f_pos is not for the purpose of serializing writes * to the same file descriptor from multiple threads. It is solely to * protect against corruption of the f_pos field leading to a severe * violation of its semantics, such as: * - a user-visible negative value on a file type which POSIX forbids * ever to have a negative offset; or * - an unexpected jump from (say) (2^32 - small) to (2^33 - small), * due to an interrupt between the two 32-bit write instructions * needed to write out an loff_t on some architectures, leading to * a delayed overwrite of half of the f_pos value written by another * thread. (Applicable to SMP and CONFIG_PREEMPT kernels.) * * Three tiers of protection on f_pos may be needed in order to trade off * between performance and least surprise: * * 1. All f_pos accesses must go through accessors that protect against * problems with atomic 64-bit writes on some platforms. These * accessors are only atomic with respect to one another. * * 2. Those few accesses that cannot handle transient negative values of * f_pos must be protected from a race in some llseek implementations * (including generic_file_llseek). Correct application code should * never encounter this race, and the syscall use cases that are * vulnerable to it are relatively infrequent. This is a job for an * rwlock, although the sense is inverted (readers need exclusive * access to a "stalled pipeline", while writers only need to be able * to fix things up after the fact in the event of an exception). * * 3. Applications that cannot handle transient overshoot on f_pos, under * conditions where several threads are writing to the same open file * concurrently and one of them experiences a short write, can be * protected from themselves by an rwsem around vfs_write(v) calls. * (The same applies to multi-threaded reads, mutatis mutandis.) * When CONFIG_WOMBAT (waste of memory, brain, and time -- thanks, * Bodo!) is enabled, this per-struct-file rwsem is taken as necessary. */ #define file_pos_local_acquire(file, flags) \ spin_lock_irqsave(file->f_pos_lock, flags) #define file_pos_local_release(file, flags) \ spin_unlock_irqrestore(file->f_pos_lock, flags) #define file_pos_excl_acquire(file, flags) \ do { \ write_lock_irqsave(file->f_pos_rwlock, flags); \ spin_lock(file->f_pos_lock); \ } while (0) #define file_pos_excl_release(file, flags) \ do { \ spin_unlock(file->f_pos_lock); \ write_unlock_irqrestore(file->f_pos_rwlock, flags); \ } while (0) #define file_pos_nonexcl_acquire(file, flags) \ do { \ read_lock_irqsave(file->f_pos_rwlock, flags); \ spin_lock(file->f_pos_lock); \ } while (0) #define file_pos_nonexcl_release(file, flags) \ do { \ spin_unlock(file->f_pos_lock); \ read_unlock_irqrestore(file->f_pos_rwlock, flags); \ } while (0) /* * Accessors for f_pos (the file descriptor "position" for seekable file * types, also of interest as a bytes read/written counter on non-seekable * file types such as pipes and FIFOs). The f_pos field of struct file * should be accessed exclusively through these functions, so that the * changes needed to interlock these accesses atomically are localized to * the accessor functions. * * file_pos_write is defined to return the old file position so that it * can be restored by the caller if appropriate. (Note that it is not * necessarily guaranteed that restoring the old position will not clobber * a value written by another thread; see below.) file_pos_adjust is also * defined to return the old file position because it is more often needed * immediately by the caller; the new position can always be obtained by * adding the value passed into the "pos" parameter to file_pos_adjust. */ /* * Architectures on which an aligned 64-bit read/write is atomic can omit * locking in file_pos_read and file_pos_write, but not in file_pos_adjust. * Not locking in file_pos_write could lead to a "lost seek" from another * thread between the old position (returned by file_pos_write) and the new * position; any use of the value returned by file_pos_write must take this * into account. */ static inline loff_t file_pos_read(const struct file *file) { loff_t oldpos; unsigned long flags; file_pos_local_acquire(file, flags); oldpos = file->f_pos; file_pos_local_release(file, flags); return oldpos; } static inline loff_t file_pos_write(struct file *file, loff_t pos) { loff_t oldpos; unsigned long flags; file_pos_local_acquire(file, flags); oldpos = file->f_pos; file->f_pos = pos; file_pos_local_release(file, flags); return oldpos; } /* * file_pos_adjust is logically a read-modify-write and could be replaced * by an atomic 64-bit read-modify-write operation on an architecture where * this is available. Such an architecture would not need f_pos_lock. */ static inline loff_t file_pos_adjust(struct file *file, loff_t offset) { loff_t oldpos; unsigned long flags; file_pos_local_acquire(file, flags); oldpos = file->f_pos; file->f_pos = oldpos + offset; file_pos_local_release(file, flags); return oldpos; } /* * file_pos_raw_write and file_pos_raw_adjust are to be called only while * the caller is already holding f_pos_lock. */ static inline loff_t file_pos_raw_adjust(struct file *file, loff_t offset) { loff_t oldpos; oldpos = file->f_pos; file->f_pos = oldpos + offset; return oldpos; } /* * file_pos_read_safe differs in principle from file_pos_read in that * it should never return a "transient" value (negative due to the race * in llseek or too large due to the overshoot in read/write). Use it * only when the caller needs an accurate position in a multi-threaded * scenario; it effectively forces a pipeline flush. */ static inline loff_t file_pos_read_safe(struct file *file) { loff_t pos; unsigned long flags; file_pos_excl_acquire(file, flags); pos = file->f_pos; file_pos_excl_release(file, flags); return pos; } /* * In the common case (seek to a valid file offset), generic_file_llseek * touches the f_pos member with exactly one accessor (file_pos_write or * file_pos_adjust). It does not perform a check against s_maxbytes, * because POSIX 2004 doesn't require such a check (EINVAL is supposed to * indicate a bad "whence" argument or an attempt to seek to a negative * offset). Filesystems are required to verify the reasonableness of the * "pos" argument to all read/write calls, which is not checked by syscall * wrappers such as sys_pwrite. * * This implementation contains a race condition which could lead to a * transiently negative value of f_pos being visible to another thread * during an llseek with SEEK_CUR to an invalid offset. POSIX requires that * "the file offset shall remain unchanged" on error, but doesn't address * atomicity issues when two threads write/seek concurrently on the same * file descriptor. * * This function does not check whether the file is a pipe/FIFO/socket and * therefore "incapable of seeking" according to POSIX! Callers who * require strict POSIX semantics (such as sys_lseek()) must do their own * checks for EBADF, ESPIPE, and EOVERFLOW. This function does, however, * handle all EINVAL cases according to the POSIX semantics of "a regular * file, block special file, or directory". It is therefore not suitable * for hypothetical devices for which a negative file offset may be valid. */ loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { /* Catch all invalid values of "origin" in one test. */ if (unlikely(((unsigned int)origin) > SEEK_END)) return -EINVAL; if (origin != SEEK_CUR) { /* Seeking to an absolute position is the simple case. */ loff_t oldpos; /* * When generic_file_llseek needs the inode in order to * retrieve the current file size, it goes through f_mapping * (the pagecache entry) rather than through f_path.dentry * (the dentry cache). All other inode retrievals in * read_write.c go through f_path.dentry. Why? */ if (origin == SEEK_END) offset += i_size_read(file->f_mapping->host); if (unlikely(offset < 0)) return -EINVAL; oldpos = file_pos_write(file, offset); if (likely(oldpos != offset)) file->f_version = 0; return offset; } else if (unlikely(offset == 0)) { /* llseek(fd, 0, SEEK_CUR) is asking to read f_pos. Use * file_pos_read_safe to avoid returning a transient value. */ return file_pos_read_safe(file); } else { loff_t oldpos, retval; unsigned long flags; file_pos_nonexcl_acquire(file, flags); oldpos = file_pos_raw_adjust(file, offset); if (unlikely(oldpos + offset < 0)) { /* !!! TRANSIENTLY NEGATIVE f_pos !!! */ /* * Several relative seeks could have raced here. * We want to leave f_pos with a non-negative value, * while obeying POSIX in spirit (no change to f_pos * on EINVAL). But we don't want to require an * atomic read-modify-write that's conditional on the * result of the arithmetic -- at least not on * architectures that lack LL/SC. So we do something * reasonable: leave a positive value in f_pos that * was the actual file offset before one of the * relative seeks. */ if (oldpos >= 0) file_pos_raw_write(file, oldpos); retval = -EINVAL; } else { file->f_version = 0; retval = oldpos + offset; } file_pos_nonexcl_release(file, flags); return retval; } /* Control should never reach the end of this function */ } ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 7:25 ` Michael K. Edwards 2007-03-13 7:42 ` David Miller @ 2007-03-13 13:15 ` Alan Cox 2007-03-14 20:09 ` Michael K. Edwards 1 sibling, 1 reply; 31+ messages in thread From: Alan Cox @ 2007-03-13 13:15 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List > But on that note -- do you have any idea how one might get ltrace to > work on a multi-threaded program, or how one might enhance it to > instrument function calls from one shared library to another? Or I don't know a vast amount about ARM ELF user space so no. > better yet, can you advise me on how to induce gdbserver to stream > traces of library/syscall entry/exits for all the threads in a > process? And then how to cram it down into the kernel so I don't take One way to do this is to use kprobes which will do exactly what you want as it builds a kernel module. Doesn't currently run on ARM afaik. > the hit for an MMU context switch every time I hit a syscall or Not easily with gdbstubs as you've got to talk to something to decide how to log the data and proceed. If you stick it kernel side its a lot of ugly new code and easier to port kprobes over, if you do it remotely as gdbstubs intends it adds latencies and screws all your timings. gdbstubs is also not terribly SMP aware and for low level work its sometimes easier to have on gdb per processor if you can get your brain around it. Alan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 13:15 ` Alan Cox @ 2007-03-14 20:09 ` Michael K. Edwards 2007-03-16 16:43 ` Frank Ch. Eigler 2007-03-16 17:25 ` Alan Cox 0 siblings, 2 replies; 31+ messages in thread From: Michael K. Edwards @ 2007-03-14 20:09 UTC (permalink / raw) To: Alan Cox; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List Thanks Alan, this is really helpful -- I'll see if I can figure out kprobes. It is not immediately obvious to me how to use them to trace function calls in userspace, but I'll read up. On 3/13/07, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > But on that note -- do you have any idea how one might get ltrace to > > work on a multi-threaded program, or how one might enhance it to > > instrument function calls from one shared library to another? Or > > I don't know a vast amount about ARM ELF user space so no. This isn't ARM-specific, although it's certainly an ELF issue. ltrace uses libelf to inspect the program binary and replace calls to shared library functions with breakpoints. When the program hits one of these breakpoints, ltrace inspects the function arguments, puts the function call back in place of the breakpoint, single steps through it, puts the breakpoint back, does some sort of sleight of hand to ensure that it will break again on function return, and continues. When it breaks again on function return, ltrace inspects the return value and then continues. I suppose I should wrap my head around the kwatch stuff and see if there's a way of doing this without intrusive manipulation of the text segment, which seems to be the thing preventing ltrace from being extensible to multi-threaded programs. The ELF handling code in ltrace is sufficiently opaque to me that I doubt I could make it work on function calls inside shared libraries without completely rewriting it -- at which point I'd rather go the extra mile with kprobes, I think. > > better yet, can you advise me on how to induce gdbserver to stream > > traces of library/syscall entry/exits for all the threads in a > > process? And then how to cram it down into the kernel so I don't take > > One way to do this is to use kprobes which will do exactly what you want > as it builds a kernel module. Doesn't currently run on ARM afaik. There doesn't seem to be an arch/arm/kernel/kprobes.c, and I'm not sure I have the skill to write one. On the other hand, ltrace manages something similar (through libelf), and I can trace ltrace's ptrace activity with strace, so maybe I can clone that. :-) > > the hit for an MMU context switch every time I hit a syscall or > > Not easily with gdbstubs as you've got to talk to something to decide how > to log the data and proceed. If you stick it kernel side its a lot of > ugly new code and easier to port kprobes over, if you do it remotely as > gdbstubs intends it adds latencies and screws all your timings. Sure does. I do have gdbserver working remotely but it's not as useful as I hoped because real-world multi-threaded code is so full of latent races that it dies instantly under gdb. Unless, of course, that multi-threaded code was systematically developed using gdb and qemu and other techniques of flushing out racy designs before their implementation is entrenched. Not typical of the embedded world, I'm sorry to say. > gdbstubs is also not terribly SMP aware and for low level work its > sometimes easier to have on gdb per processor if you can get your brain > around it. That's a trick I don't know. What do you do, fire up the target process, put the whole thing to sleep with a SIGSTOP, and then attach a separate gdb to each thread after they've been migrated and locked down to the destination CPU? Cheers, - Michael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-14 20:09 ` Michael K. Edwards @ 2007-03-16 16:43 ` Frank Ch. Eigler 2007-03-16 17:25 ` Alan Cox 1 sibling, 0 replies; 31+ messages in thread From: Frank Ch. Eigler @ 2007-03-16 16:43 UTC (permalink / raw) To: Michael K. Edwards Cc: Alan Cox, Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List "Michael K. Edwards" <medwards.linux@gmail.com> writes: > Thanks Alan, this is really helpful -- I'll see if I can figure out > kprobes. It is not immediately obvious to me how to use them to > trace function calls in userspace, but I'll read up. kprobes does not work with userspace. We in systemtap land have been pushing for such an extension, and primarily IBM folks have been writing one that would be acceptable for LKML. > [...] There doesn't seem to be an arch/arm/kernel/kprobes.c, and > I'm not sure I have the skill to write one. On the other hand, > ltrace manages something similar (through libelf), and I can trace > ltrace's ptrace activity with strace, so maybe I can clone that. > :-) Efforts are underway to port kprobes to ARM. Both these efforts may be tracked on the public systemtap mailing list. - FChE ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-14 20:09 ` Michael K. Edwards 2007-03-16 16:43 ` Frank Ch. Eigler @ 2007-03-16 17:25 ` Alan Cox 1 sibling, 0 replies; 31+ messages in thread From: Alan Cox @ 2007-03-16 17:25 UTC (permalink / raw) To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List > > gdbstubs is also not terribly SMP aware and for low level work its > > sometimes easier to have on gdb per processor if you can get your brain > > around it. > > That's a trick I don't know. What do you do, fire up the target > process, put the whole thing to sleep with a SIGSTOP, and then attach > a separate gdb to each thread after they've been migrated and locked > down to the destination CPU? gdbstubs/kgdbstubs is kernel side so you boot the two cores and each core halts in the debugger on a breakpoint trap. One debug stub uses one serial port (or one UDP port), the other uses a different one. Two gdbs and you can stop/play with each processor independently. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: sys_write() racy for multi-threaded append? 2007-03-13 2:24 ` Alan Cox 2007-03-13 7:25 ` Michael K. Edwards @ 2007-03-13 14:00 ` David M. Lloyd 1 sibling, 0 replies; 31+ messages in thread From: David M. Lloyd @ 2007-03-13 14:00 UTC (permalink / raw) To: Alan Cox Cc: Michael K. Edwards, Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List On Tue, 2007-03-13 at 02:24 +0000, Alan Cox wrote: > > purports to handle short writes but has never been exercised is > > arguably worse than code that simply bombs on short write. So if I > > can't shim in an induce-short-writes-randomly-on-purpose mechanism > > during development, I don't want short writes in production, period. > > Easy enough to do and gcov plus dejagnu or similar tools will let you > coverage analyse the resulting test set and replay it. You don't even need special tools: just change your code that says: foo = write(fd, mybuf, mycount); to say (for example): foo = write(fd, mybuf, mycount / randomly_either_1_or_2); Why would this need kernel support? The average developer doesn't really need to verify that the *kernel* works. They just need to test their own code paths - and in this case, they can see that foo is less than mycount (sometimes). The code paths don't care that it was not the kernel that caused it. - DML ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2007-03-16 16:45 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-03-08 23:08 sys_write() racy for multi-threaded append? Michael K. Edwards 2007-03-08 23:43 ` Eric Dumazet 2007-03-08 23:57 ` Michael K. Edwards 2007-03-09 0:15 ` Eric Dumazet 2007-03-09 0:45 ` Michael K. Edwards 2007-03-09 1:34 ` Benjamin LaHaise 2007-03-09 12:19 ` Michael K. Edwards 2007-03-09 13:44 ` Eric Dumazet 2007-03-09 14:10 ` Alan Cox 2007-03-09 14:59 ` Benjamin LaHaise 2007-03-10 6:43 ` Michael K. Edwards 2007-03-09 5:53 ` Eric Dumazet 2007-03-09 11:52 ` Michael K. Edwards 2007-03-09 0:43 ` Alan Cox [not found] <7WzUo-1zl-21@gated-at.bofh.it> [not found] ` <7WAx2-2pg-21@gated-at.bofh.it> [not found] ` <7WAGF-2Bx-9@gated-at.bofh.it> [not found] ` <7WB07-3g5-33@gated-at.bofh.it> [not found] ` <7WBt7-3SZ-23@gated-at.bofh.it> 2007-03-12 7:53 ` Bodo Eggert 2007-03-12 16:26 ` Michael K. Edwards 2007-03-12 18:48 ` Bodo Eggert 2007-03-13 0:46 ` Michael K. Edwards 2007-03-13 2:24 ` Alan Cox 2007-03-13 7:25 ` Michael K. Edwards 2007-03-13 7:42 ` David Miller 2007-03-13 16:24 ` Michael K. Edwards 2007-03-13 17:59 ` Michael K. Edwards 2007-03-13 19:09 ` Christoph Hellwig 2007-03-13 23:40 ` Michael K. Edwards 2007-03-14 0:09 ` Michael K. Edwards 2007-03-13 13:15 ` Alan Cox 2007-03-14 20:09 ` Michael K. Edwards 2007-03-16 16:43 ` Frank Ch. Eigler 2007-03-16 17:25 ` Alan Cox 2007-03-13 14:00 ` David M. Lloyd
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).