Re: sys_write() racy for multi-threaded append?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: sys_write() racy for multi-threaded append?
       [not found]       ` <7WBt7-3SZ-23@gated-at.bofh.it>
@ 2007-03-12  7:53         ` Bodo Eggert
  2007-03-12 16:26           ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Bodo Eggert @ 2007-03-12  7:53 UTC (permalink / raw)
  To: Michael K. Edwards, Eric Dumazet, Linux Kernel Mailing List

Michael K. Edwards <medwards.linux@gmail.com> wrote:

> On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Absolutely not. We dont want to slow down kernel 'just in case a fool might
>> want to do crazy things'
> 
> Actually, I think it would make the kernel (negligibly) faster to bump
> f_pos before the vfs_write() call.

This is a security risk.

----------------
other process:
unlink(secrest_file)

Thread 1:
write(fd, large)
(interrupted)

Thread 2:
fseek(fd, -n, relative)
read(fd, buf)
----------------

BTW: The best thing you can do to a program where two threads race for
writing one fd is to let it crash and burn in the most spectacular way
allowed without affecting the rest of the system, unless it happens to
be a pipe and the number of bytes written is less than PIPE_MAX.
-- 
The secret of the universe is #@*%! NO CARRIER 

Friß, Spammer: dyIw3Rs@x.7eggert.dyndns.org PLxmr@lv.7eggert.dyndns.org
 HmiJuSaiuF@b.7eggert.dyndns.org rKjmsxE@7eggert.dyndns.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-12  7:53         ` sys_write() racy for multi-threaded append? Bodo Eggert
@ 2007-03-12 16:26           ` Michael K. Edwards
  2007-03-12 18:48             ` Bodo Eggert
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-12 16:26 UTC (permalink / raw)
  To: 7eggert; +Cc: Eric Dumazet, Linux Kernel Mailing List

On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote:
> Michael K. Edwards <medwards.linux@gmail.com> wrote:
> > Actually, I think it would make the kernel (negligibly) faster to bump
> > f_pos before the vfs_write() call.
>
> This is a security risk.
>
> ----------------
> other process:
> unlink(secrest_file)
>
> Thread 1:
> write(fd, large)
> (interrupted)
>
> Thread 2:
> fseek(fd, -n, relative)
> read(fd, buf)
> ----------------

I don't entirely follow this.  Which process is supposed to be secure
here, and what information is supposed to be secret from whom?  But in
any case, are you aware that f_pos is clamped to the inode's
max_bytes, not to the current file size?  Thread 2 can seek/read past
the position at which Thread 1's write began, whether f_pos is bumped
before or after the vfs_write.

> BTW: The best thing you can do to a program where two threads race for
> writing one fd is to let it crash and burn in the most spectacular way
> allowed without affecting the rest of the system, unless it happens to
> be a pipe and the number of bytes written is less than PIPE_MAX.

That's fine when you're doing integration test, and should probably be
the default during development.  But if the race is first exposed in
the field, or if the developer is trying to concentrate on a different
problem, "spectacular crash and burn" may do more harm than good.
It's easy enough to refactor the f_pos handling in the kernel so that
it all goes through three or four inline accessor functions, at which
point you can choose your trade-off between speed and idiot-proofness
-- at _kernel_ compile time, or (given future hardware that supports
standardized optionally-atomic-based-on-runtime-flag operations) per
process at run-time.

Frankly, I think that unless application programmers poke at some sort
of magic "I promise to handle short writes correctly" bit, write()
should always return either the full number of bytes requested or an
error code.  If they do poke at that bit, the (development) kernel
should deliberately split writes a few percent of the time, just to
exercise the short-write code paths.  And in order to find out where
that magic bit is, they should have to read the kernel code or ask on
LKML (and get the "standard lecture").

Really very IEEE754-like, actually.  (Harp, harp.)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-12 16:26           ` Michael K. Edwards
@ 2007-03-12 18:48             ` Bodo Eggert
  2007-03-13  0:46               ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Bodo Eggert @ 2007-03-12 18:48 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: 7eggert, Eric Dumazet, Linux Kernel Mailing List

On Mon, 12 Mar 2007, Michael K. Edwards wrote:
> On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote:
> > Michael K. Edwards <medwards.linux@gmail.com> wrote:

> > > Actually, I think it would make the kernel (negligibly) faster to bump
> > > f_pos before the vfs_write() call.
> >
> > This is a security risk.

<snip>

> I don't entirely follow this.

I got it wrong, and I didn't CC myself so I couldn't correct it in time.

If it was a security risk, it would exist right now, too, because of 
other access patterns.

> > BTW: The best thing you can do to a program where two threads race for
> > writing one fd is to let it crash and burn in the most spectacular way
> > allowed without affecting the rest of the system, unless it happens to
> > be a pipe and the number of bytes written is less than PIPE_MAX.
> 
> That's fine when you're doing integration test, and should probably be
> the default during development.  But if the race is first exposed in
> the field, or if the developer is trying to concentrate on a different
> problem, "spectacular crash and burn" may do more harm than good.
> It's easy enough to refactor the f_pos handling in the kernel so that
> it all goes through three or four inline accessor functions, at which
> point you can choose your trade-off between speed and idiot-proofness
> -- at _kernel_ compile time, or (given future hardware that supports
> standardized optionally-atomic-based-on-runtime-flag operations) per
> process at run-time.

CONFIG_WOMBAT

Waste memory, brain and time in order to grant an atomic write which is 
neither guaranteed by the standard nor expected by any sane programmer, 
just in case some idiot tries to write to one file from multiple 
processes.

Warning: Programs expecting this behaviour are buggy and non-portable.

> Frankly, I think that unless application programmers poke at some sort
> of magic "I promise to handle short writes correctly" bit, write()
> should always return either the full number of bytes requested or an
> error code.

If you asume that you won't have short writes, your programs may fail on
e.g. solaris. There may be reasons for linux to use the same semantics at 
some time in the future, you never know.

If you asume you *may* have short writes, you have no problem.

>  If they do poke at that bit, the (development) kernel
> should deliberately split writes a few percent of the time, just to
> exercise the short-write code paths.  And in order to find out where
> that magic bit is, they should have to read the kernel code or ask on
> LKML (and get the "standard lecture").

I think it may be well-hidden in menuconfig.

> Really very IEEE754-like, actually.  (Harp, harp.)

-EYOULOSTME
-- 
"Try to look unimportant; they may be low on ammo."
-Infantry Journal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-12 18:48             ` Bodo Eggert
@ 2007-03-13  0:46               ` Michael K. Edwards
  2007-03-13  2:24                 ` Alan Cox
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-13  0:46 UTC (permalink / raw)
  To: Bodo Eggert; +Cc: Eric Dumazet, Linux Kernel Mailing List

On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote:
> On Mon, 12 Mar 2007, Michael K. Edwards wrote:
> > That's fine when you're doing integration test, and should probably be
> > the default during development.  But if the race is first exposed in
> > the field, or if the developer is trying to concentrate on a different
> > problem, "spectacular crash and burn" may do more harm than good.
> > It's easy enough to refactor the f_pos handling in the kernel so that
> > it all goes through three or four inline accessor functions, at which
> > point you can choose your trade-off between speed and idiot-proofness
> > -- at _kernel_ compile time, or (given future hardware that supports
> > standardized optionally-atomic-based-on-runtime-flag operations) per
> > process at run-time.
>
> CONFIG_WOMBAT
>
> Waste memory, brain and time in order to grant an atomic write which is
> neither guaranteed by the standard nor expected by any sane programmer,
> just in case some idiot tries to write to one file from multiple
> processes.
>
> Warning: Programs expecting this behaviour are buggy and non-portable.

OK, I laughed out loud at this.  But I think you're missing my point,
which is that there's a time to be hard-core about code quality and
there's a time to be hard-core about _product_ quality.  Face it, all
products containing software more or less suck.  This is because most
programmers write crap code most of the time.  The only way to cope
with this, outside the confines of the European defense industry and
other niches insulated from economic reality, is to make the
production environment gentler on _application_ code than the
development environment is.  Hence CONFIG_WOMBAT.  (I like that name.
I'm going to use it in my patch, with your permission.  :-)

Writing to a file from multiple processes is not usually the problem.
Writing to a common "struct file" from multiple threads is.  99.999%
of the time it will work, because you're only writing as far as VFS
cache and then bumping f_pos, and your threads are probably on the
same processor anyway.  0.001% of the time the second thread will see
a stale f_pos and clobber the first write.  This is true even on file
types that can never return a short write.  If you remember to open
with O_APPEND so the pos argument to vfs_write is silently ignored, or
if the implementation underlying vfs_write effectively ignores the pos
argument irrespective of flags, you're OK.  If the pos argument isn't
ignored, or if you ever look at the result of a relative seek on any
fd that maps to that struct file, you're screwed.

(Note to the alert reader:  yes, this means shell scripts should
always use >> rather than > when routing stdout and/or stderr to a
file.  You're just as vulnerable to interleaving due to stdio
buffering issues as you are when stdio and stderr are sent to the tty,
and short writes may still be a problem if you are so foolish as to
use a filesystem that generates them on anything short of a
catastrophic error, but at least you get O_APPEND and sane behavior on
ftruncate().)

> > Frankly, I think that unless application programmers poke at some sort
> > of magic "I promise to handle short writes correctly" bit, write()
> > should always return either the full number of bytes requested or an
> > error code.
>
> If you asume that you won't have short writes, your programs may fail on
> e.g. solaris. There may be reasons for linux to use the same semantics at
> some time in the future, you never know.

So what?  My products are shipping _now_.  Future kernels are
guaranteed to break them anyway because sysfs is a moving target.
Solaris is so not in the game for my kind of embedded work, it's not
even funny.  If POSIX mandates stupid shit, and application
programmers don't read that part of the manual anyway (and don't code
on that assumption in practice), to hell with POSIX.  On many file
descriptors, short writes simply can't happen -- and code that
purports to handle short writes but has never been exercised is
arguably worse than code that simply bombs on short write.  So if I
can't shim in an induce-short-writes-randomly-on-purpose mechanism
during development, I don't want short writes in production, period.

In my world, GNU/Linux is not a crappy imitation Solaris that you get
to pay out the wazoo for to Red Hat (and get no documentation and
lousy tech support that doesn't even cover your hardware).  It's a
full-source-code platform on which you can engineer robust industrial
and consumer products, because you can control the freeze and release
schedule component-by-component, and you can point fix anything in the
system at any time.  If, that is, you understand that the source code
is not the software, and that you can't retrofit stability and
security overnight onto code that was written with no thought of
anything but performance.

> If you asume you *may* have short writes, you have no problem.

Sure -- until the one code path in a hundred that handles the "short
write" case incorrectly gets traversed in production, after having
gone untested in a development environment that used a different
filesystem that never happened to trigger it.

> >  If they do poke at that bit, the (development) kernel
> > should deliberately split writes a few percent of the time, just to
> > exercise the short-write code paths.  And in order to find out where
> > that magic bit is, they should have to read the kernel code or ask on
> > LKML (and get the "standard lecture").
>
> I think it may be well-hidden in menuconfig.
>
> > Really very IEEE754-like, actually.  (Harp, harp.)
>
> -EYOULOSTME

I have been harping in other threads on the desirability of an AIO
model that resembles IEEE754 floating point in some key respects that
would make it practical to implement AEIOUs (asynchronously executed
I/O units) in future hardware.  That means revamping the calling
conventions and exception semantics to reflect what latitude the
implementors of pipelined AIO coprocessors are likely to need, not the
designed-by-committee, implemented-by-accretion POSIX heritage.  My
advocacy in mere human language has so far resulted in -ENOPATCH, so
I'm having the alternate joy and misery of rewriting fs/read_write.c
and all its friends.  I'm learning quite a lot about Linux internals
in the process, too -- at multiple levels, coding and design and
community dynamics.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  0:46               ` Michael K. Edwards
@ 2007-03-13  2:24                 ` Alan Cox
  2007-03-13  7:25                   ` Michael K. Edwards
  2007-03-13 14:00                   ` David M. Lloyd
  0 siblings, 2 replies; 31+ messages in thread
From: Alan Cox @ 2007-03-13  2:24 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

> Writing to a file from multiple processes is not usually the problem.
> Writing to a common "struct file" from multiple threads is.

Not normally because POSIX sensibly invented pread/pwrite. Forgot
preadv/pwritev but they did the basics and end of problem

> So what?  My products are shipping _now_.  

That doesn't inspire confidence.

> even funny.  If POSIX mandates stupid shit, and application
> programmers don't read that part of the manual anyway (and don't code
> on that assumption in practice), to hell with POSIX.  On many file

Thats funny, you were talking about quality a moment ago.

> descriptors, short writes simply can't happen -- and code that

There is almost no descriptor this is true for. Any file I/O can and will
end up short on disk full or resource limit exceeded or quota exceeded or
NFS server exploded or ...

And on the device side about the only thing with the vaguest guarantees
is pipe().

> purports to handle short writes but has never been exercised is
> arguably worse than code that simply bombs on short write.  So if I
> can't shim in an induce-short-writes-randomly-on-purpose mechanism
> during development, I don't want short writes in production, period.

Easy enough to do and gcov plus dejagnu or similar tools will let you
coverage analyse the resulting test set and replay it.

> Sure -- until the one code path in a hundred that handles the "short
> write" case incorrectly gets traversed in production, after having
> gone untested in a development environment that used a different
> filesystem that never happened to trigger it.

Competent QA and testing people test all the returns in the manual as
well as all the returns they can find in the code. See ptrace(2) if you
don't want to do a lot of relinking and strace for some useful worked
examples of syscall hooking.

Alan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  2:24                 ` Alan Cox
@ 2007-03-13  7:25                   ` Michael K. Edwards
  2007-03-13  7:42                     ` David Miller
  2007-03-13 13:15                     ` Alan Cox
  2007-03-13 14:00                   ` David M. Lloyd
  1 sibling, 2 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-13  7:25 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

On 3/12/07, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > Writing to a file from multiple processes is not usually the problem.
> > Writing to a common "struct file" from multiple threads is.
>
> Not normally because POSIX sensibly invented pread/pwrite. Forgot
> preadv/pwritev but they did the basics and end of problem

pread/pwrite address a miniscule fraction of lseek+read(v)/write(v)
use cases -- a fraction that someone cared about strongly enough to
get into X/Open CAE Spec Issue 5 Version 2 (1997), from which it
propagated into UNIX98 and thence into POSIX.2 2001.  The fact that no
one has bothered to implement preadv/pwritev in the decade since
pread/pwrite entered the Single UNIX standard reflects the rarity with
which they appear in general code.  Life is too short to spend it
rewriting application code that uses readv/writev systematically,
especially when that code is going to ship inside a widget whose
kernel you control.

> > So what?  My products are shipping _now_.
>
> That doesn't inspire confidence.

Oh, please.  Like _your_ employer is the poster child for code
quality.  The cheap shot is also irrelevant to the point that I was
making, which is that sometimes portability simply doesn't matter and
the right thing to do is to firm up the semantics of the filesystem
primitives from underneath.

> > even funny.  If POSIX mandates stupid shit, and application
> > programmers don't read that part of the manual anyway (and don't code
> > on that assumption in practice), to hell with POSIX.  On many file
>
> Thats funny, you were talking about quality a moment ago.

Quality means the devices you ship now keep working in the field, and
the probable cost of later rework if the requirements change does not
exceed the opportunity cost of over-engineering up front.  Economy
gets a look-in too, and says that it's pointless to delay shipment and
bloat the application coding for cases that can't happen.  If POSIX
says that any and all writes (except small pipe/FIFO writes, whatever)
can return a short byte count -- but you know damn well you're writing
to a device driver that never, ever writes short, and if it did you'd
miss a timing budget recovering from it anyway -- to hell with POSIX.
And if you want to build a test jig for this code that uses pipes or
dummy files in place of the device driver, that test jig should never,
ever write short either.

> > descriptors, short writes simply can't happen -- and code that
>
> There is almost no descriptor this is true for. Any file I/O can and will
> end up short on disk full or resource limit exceeded or quota exceeded or
> NFS server exploded or ...

Not on a properly engineered widget, it won't.  And if it does, and
the application isn't coded to cope in some way totally different from
an infinite retry loop, then you might as well signal the exception
condition using whatever mechanism is appropriate to the API
(-EWHATEVER, SIGCRISIS, or block until some other process makes room).
 And in any case files on disk are the least interesting kind of file
descriptor in an embedded scenario -- devices and pipes and pollfds
and netlink sockets are far more frequent read/write targets.

> And on the device side about the only thing with the vaguest guarantees
> is pipe().

Guaranteed by the standard, sure.  Guaranteed by the implementation,
as long as you write in the size blocks that the device is expecting?
Lots of devices -- ALSA's OSS PCM emulation, most AF_LOCAL and
AF_NETLINK sockets, almost any "character" device with a
record-structured format.  A short write to any of these almost
certainly means the framing is screwed and you need to close and
reopen the device.  Not all of these are exclusively O_APPEND
situations, and there's no reason on earth not to thread-safe the
f_pos handling so that an application and filesystem/driver can agree
on useful lseek() semantics.

> > purports to handle short writes but has never been exercised is
> > arguably worse than code that simply bombs on short write.  So if I
> > can't shim in an induce-short-writes-randomly-on-purpose mechanism
> > during development, I don't want short writes in production, period.
>
> Easy enough to do and gcov plus dejagnu or similar tools will let you
> coverage analyse the resulting test set and replay it.

Here we agree.  Except that I've rarely seen embedded application code
that wouldn't explode in my face if I tried it.  Databases yes, and
the better class of mail and web servers, and relatively mature
scripting languages and bytecode interpreters; but the vast majority
of working programmers in these latter days do not exercise this level
of discipline.

> > Sure -- until the one code path in a hundred that handles the "short
> > write" case incorrectly gets traversed in production, after having
> > gone untested in a development environment that used a different
> > filesystem that never happened to trigger it.
>
> Competent QA and testing people test all the returns in the manual as
> well as all the returns they can find in the code. See ptrace(2) if you
> don't want to do a lot of relinking and strace for some useful worked
> examples of syscall hooking.

Even in the "enterprise" space, most of the QA and testing people I
have dealt with couldn't hook a syscall if their children were
starving and the fish were only biting on syscalls.  The embedded
situation is even worse.  ltrace didn't work on ARM for years and
hardly anyone _noticed_, let alone did anything about it.  (I posted a
fix to ltrace-devel a month ago, but evidently the few fish left in
that river don't bite on hooked syscalls.)  strace maintenance doesn't
seem too healthy either, judging by what I had to do it in order to
get it to recognize ALSA ioctls and the quasi-syscalls involved in ARM
TLS.

But on that note -- do you have any idea how one might get ltrace to
work on a multi-threaded program, or how one might enhance it to
instrument function calls from one shared library to another?  Or
better yet, can you advise me on how to induce gdbserver to stream
traces of library/syscall entry/exits for all the threads in a
process?  And then how to cram it down into the kernel so I don't take
the hit for an MMU context switch every time I hit a syscall or
breakpoint in the process under test?  That would be a really useful
tool for failure analysis in embedded Linux, doubly so on multi-core
chips, especially if it could be made minimally intrusive on the
CPU(s) where the application is running.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  7:25                   ` Michael K. Edwards
@ 2007-03-13  7:42                     ` David Miller
  2007-03-13 16:24                       ` Michael K. Edwards
  2007-03-13 13:15                     ` Alan Cox
  1 sibling, 1 reply; 31+ messages in thread
From: David Miller @ 2007-03-13  7:42 UTC (permalink / raw)
  To: medwards.linux; +Cc: alan, 7eggert, dada1, linux-kernel

From: "Michael K. Edwards" <medwards.linux@gmail.com>
Date: Mon, 12 Mar 2007 23:25:48 -0800

> Quality means the devices you ship now keep working in the field, and
> the probable cost of later rework if the requirements change does not
> exceed the opportunity cost of over-engineering up front.  Economy
> gets a look-in too, and says that it's pointless to delay shipment and
> bloat the application coding for cases that can't happen.  If POSIX
> says that any and all writes (except small pipe/FIFO writes, whatever)
> can return a short byte count -- but you know damn well you're writing
> to a device driver that never, ever writes short, and if it did you'd
> miss a timing budget recovering from it anyway -- to hell with POSIX.

You're not even safe over standard output, simply run the program over
ssh and you suddenly have socket semantics to deal with.

In the early days the fun game to play was to run programs over rsh to
see in what amusing way they would explode.  ssh has replaced rsh in
this game, but the bugs have largely stayed the same.

Even early versions of tar used to explode on TCP half-closes and
whatnot.

In short, if you don't handle short writes, you're writing a program
for something other than unix.

We're not changing write() to interlock with other parallel callers or
messing with the f_pos semantics in such cases, that's stupid, please
cope, kthx.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  7:25                   ` Michael K. Edwards
  2007-03-13  7:42                     ` David Miller
@ 2007-03-13 13:15                     ` Alan Cox
  2007-03-14 20:09                       ` Michael K. Edwards
  1 sibling, 1 reply; 31+ messages in thread
From: Alan Cox @ 2007-03-13 13:15 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

> But on that note -- do you have any idea how one might get ltrace to
> work on a multi-threaded program, or how one might enhance it to
> instrument function calls from one shared library to another?  Or

I don't know a vast amount about ARM ELF user space so no.

> better yet, can you advise me on how to induce gdbserver to stream
> traces of library/syscall entry/exits for all the threads in a
> process?  And then how to cram it down into the kernel so I don't take

One way to do this is to use kprobes which will do exactly what you want
as it builds a kernel module. Doesn't currently run on ARM afaik.

> the hit for an MMU context switch every time I hit a syscall or

Not easily with gdbstubs as you've got to talk to something to decide how
to log the data and proceed. If you stick it kernel side its a lot of
ugly new code and easier to port kprobes over, if you do it remotely as
gdbstubs intends it adds latencies and screws all your timings.

gdbstubs is also not terribly SMP aware and for low level work its
sometimes easier to have on gdb per processor if you can get your brain
around it.

Alan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  2:24                 ` Alan Cox
  2007-03-13  7:25                   ` Michael K. Edwards
@ 2007-03-13 14:00                   ` David M. Lloyd
  1 sibling, 0 replies; 31+ messages in thread
From: David M. Lloyd @ 2007-03-13 14:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Michael K. Edwards, Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

On Tue, 2007-03-13 at 02:24 +0000, Alan Cox wrote:
> > purports to handle short writes but has never been exercised is
> > arguably worse than code that simply bombs on short write.  So if I
> > can't shim in an induce-short-writes-randomly-on-purpose mechanism
> > during development, I don't want short writes in production, period.
> 
> Easy enough to do and gcov plus dejagnu or similar tools will let you
> coverage analyse the resulting test set and replay it.

You don't even need special tools: just change your code that says:

	foo = write(fd, mybuf, mycount);

to say (for example):

	foo = write(fd, mybuf, mycount / randomly_either_1_or_2);

Why would this need kernel support?  The average developer doesn't
really need to verify that the *kernel* works.  They just need to test
their own code paths - and in this case, they can see that foo is less
than mycount (sometimes).  The code paths don't care that it was not the
kernel that caused it.

- DML


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13  7:42                     ` David Miller
@ 2007-03-13 16:24                       ` Michael K. Edwards
  2007-03-13 17:59                         ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-13 16:24 UTC (permalink / raw)
  To: David Miller; +Cc: alan, 7eggert, dada1, linux-kernel

On 3/13/07, David Miller <davem@davemloft.net> wrote:
> You're not even safe over standard output, simply run the program over
> ssh and you suddenly have socket semantics to deal with.

I'm intimately familiar with this one.  Easily worked around by piping
the output through cat or tee.  Not that one should ever write code
for a *nix box that can't cope with stdout being a socket or tty; but
sometimes the quickest way to glue existing code into a pipeline is to
pass /dev/stdout in place of a filename, and there's no real value in
reworking legacy code to handle short writes when you can just drop in
cat.

> In short, if you don't handle short writes, you're writing a program
> for something other than unix.

Right in one.  You're writing a program for a controlled environment,
whether it's a Linux-only API (netlink sockets) or a Linux-only
embedded product.  And there's no need for that controlled environment
to gratuitously reproduce the overly vague semantics of the *nix zoo.

> We're not changing write() to interlock with other parallel callers or
> messing with the f_pos semantics in such cases, that's stupid, please
> cope, kthx.

This is not what I am proposing, and certainly not what I'm
implementing.  Right now f_pos handling is gratuitously thread-unsafe
even in the common, all writes completed normally case.  writev()
opens the largest possible window to f_pos corruption by delaying the
f_pos update until after the vfs_writev() completes.  That's the main
thing I want to see fixed.

_If_ it proves practical to wrap accessor functions around f_pos
accesses, and _if_ accesses to f_pos from outside read_write.c can be
made robust against transient overshoot, then there is no harm in
tightening up f_pos semantics so that successful pipelined writes to
the same file don't collide so easily.  If read-modify-write on f_pos
were atomic, it would be easy to guarantee that it is in a sane state
any time a syscall is not actually outstanding on that struct file.
There's even a potential performance gain from streamlining the
pattern of L1 cache usage in the common case, although I expect it's
negligible.

Making read-modify-write atomic on f_pos is of course not free on all
platforms.  But once you have the accessors, you can decide at kernel
compile time whether or not to interlock them with spinlocks and
memory barriers or some such.  Most platforms probably needn't bother;
there are much bigger time bombs lurking elsewhere in the kernel.  x86
is a bit dicey, because there's no way to atomically update a 64-bit
loff_t in memory, and so in principle you could get a race on carry
across the 32-bit boundary.  (This file instantaneously jumped from
4GB to 8GB in size?  What the hell?  How long would it take you to
spot a preemption between writing out the upper and lower halves of a
64-bit integer?)

In any case, I have certainly gotten the -ENOPATCH message by now.  It
must be nice not to have to care whether things work in the field if
you can point at something stupid that an application programmer did.
I don't always have that luxury.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13 16:24                       ` Michael K. Edwards
@ 2007-03-13 17:59                         ` Michael K. Edwards
  2007-03-13 19:09                           ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-13 17:59 UTC (permalink / raw)
  To: David Miller; +Cc: alan, 7eggert, dada1, linux-kernel

Clearly f_pos atomicity has been handled differently in the not-so-distant past:

http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg01628.html

And equally clearly the current generic_file_llseek semantics are
erroneous for large offsets, and we shouldn't be taking the inode
mutex in any case other than SEEK_END:

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=100584441922835&w=2

read_write.c is a perfect example of the relative amateurishness of
parts of the Linux kernel.  It has probably gone through a dozen major
refactorings in fifteen years but has never acquired a "theory of
operations" section justifying the errors returned in terms of
standards and common usage patterns.  It doesn't have assertion-style
precondition/postcondition checks.  It's full of bare constants and
open-coded flag tests.  It exposes sys_* and vfs_* and generic_file_*
without any indication of which sanity checks you're bypassing when
you call the inner functions directly.  It leaks performance right and
left with missing likely() macros and function calls that can't be
inlined because there's no interface/implementation split.  And then
you want to tell me that it costs too much to spinlock around f_pos
updates when file->f_count > 1?  Give me a break.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13 17:59                         ` Michael K. Edwards
@ 2007-03-13 19:09                           ` Christoph Hellwig
  2007-03-13 23:40                             ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2007-03-13 19:09 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: David Miller, alan, 7eggert, dada1, linux-kernel

Michael, please stop spreading this utter bullshit _now_.  You're so
full of half-knowledge that it's not funny anymore, and you try
to insult people knowing a few magniutes more than you left and right.

Can you please select a different mailinglist for your trolling?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13 19:09                           ` Christoph Hellwig
@ 2007-03-13 23:40                             ` Michael K. Edwards
  2007-03-14  0:09                               ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-13 23:40 UTC (permalink / raw)
  To: Christoph Hellwig, Michael K. Edwards, David Miller, alan,
	7eggert, dada1, linux-kernel

On 3/13/07, Christoph Hellwig <hch@infradead.org> wrote:
> Michael, please stop spreading this utter bullshit _now_.  You're so
> full of half-knowledge that it's not funny anymore, and you try
> to insult people knowing a few magniutes more than you left and right.

Thank you Christoph for that informative response to my comments.  I
take it that you consider read_write.c to be code of the highest
quality and maintainability.  If you have something specific in mind
when you write "utter bullshit" and "half-knowledge", I'd love to hear
it.

Now, for those who still care to respond as if improving the kernel
were a goal that you and I can share, a question:  When
generic_file_llseek needs the inode in order to retrieve the current
file size, it goes through f_mapping (the pagecache entry?) rather
than through f_path.dentry (the dentry cache?).  All other inode
retrievals in read_write.c go through f_path.dentry.  Why?  Or is this
a question that can only be asked on linux-fsdevel?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13 23:40                             ` Michael K. Edwards
@ 2007-03-14  0:09                               ` Michael K. Edwards
  0 siblings, 0 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-14  0:09 UTC (permalink / raw)
  To: Christoph Hellwig, Michael K. Edwards, David Miller, alan,
	7eggert, dada1, linux-kernel

In case anyone cares, this is a snippet of my work-in-progress
read_write.c illustrating how I might handle f_pos.  Can anyone point
me to data showing whether it's worth avoiding the spinlock when the
"struct file" is not shared between threads?  (In my world,
correctness comes before code-bumming as long as the algorithm scales
properly, and there are a fair number of corner cases to think through
-- although one might be able to piggy-back on the logic in
fget_light.)

Cheers,
- Michael

/*
 *  Synchronization of f_pos is not for the purpose of serializing writes
 *  to the same file descriptor from multiple threads.  It is solely to
 *  protect against corruption of the f_pos field leading to a severe
 *  violation of its semantics, such as:
 *      - a user-visible negative value on a file type which POSIX forbids
 *        ever to have a negative offset; or
 *      - an unexpected jump from (say) (2^32 - small) to (2^33 - small),
 *        due to an interrupt between the two 32-bit write instructions
 *        needed to write out an loff_t on some architectures, leading to
 *        a delayed overwrite of half of the f_pos value written by another
 *        thread.  (Applicable to SMP and CONFIG_PREEMPT kernels.)
 *
 *  Three tiers of protection on f_pos may be needed in order to trade off
 *  between performance and least surprise:
 *
 *    1. All f_pos accesses must go through accessors that protect against
 *       problems with atomic 64-bit writes on some platforms.  These
 *       accessors are only atomic with respect to one another.
 *
 *    2. Those few accesses that cannot handle transient negative values of
 *       f_pos must be protected from a race in some llseek implementations
 *       (including generic_file_llseek).  Correct application code should
 *       never encounter this race, and the syscall use cases that are
 *       vulnerable to it are relatively infrequent.  This is a job for an
 *       rwlock, although the sense is inverted (readers need exclusive
 *       access to a "stalled pipeline", while writers only need to be able
 *       to fix things up after the fact in the event of an exception).
 *
 *    3. Applications that cannot handle transient overshoot on f_pos, under
 *       conditions where several threads are writing to the same open file
 *       concurrently and one of them experiences a short write, can be
 *       protected from themselves by an rwsem around vfs_write(v) calls.
 *       (The same applies to multi-threaded reads, mutatis mutandis.)
 *       When CONFIG_WOMBAT (waste of memory, brain, and time -- thanks,
 *       Bodo!) is enabled, this per-struct-file rwsem is taken as necessary.
 */

#define file_pos_local_acquire(file, flags) \
        spin_lock_irqsave(file->f_pos_lock, flags)

#define file_pos_local_release(file, flags) \
        spin_unlock_irqrestore(file->f_pos_lock, flags)

#define file_pos_excl_acquire(file, flags) \
        do {                                                            \
                write_lock_irqsave(file->f_pos_rwlock, flags);          \
                spin_lock(file->f_pos_lock);                            \
        } while (0)

#define file_pos_excl_release(file, flags) \
        do {                                                            \
                spin_unlock(file->f_pos_lock);                          \
                write_unlock_irqrestore(file->f_pos_rwlock, flags);     \
        } while (0)

#define file_pos_nonexcl_acquire(file, flags) \
        do {                                                            \
                read_lock_irqsave(file->f_pos_rwlock, flags);           \
                spin_lock(file->f_pos_lock);                            \
        } while (0)

#define file_pos_nonexcl_release(file, flags) \
        do {                                                            \
                spin_unlock(file->f_pos_lock);                          \
                read_unlock_irqrestore(file->f_pos_rwlock, flags);      \
        } while (0)

/*
 *  Accessors for f_pos (the file descriptor "position" for seekable file
 *  types, also of interest as a bytes read/written counter on non-seekable
 *  file types such as pipes and FIFOs).  The f_pos field of struct file
 *  should be accessed exclusively through these functions, so that the
 *  changes needed to interlock these accesses atomically are localized to
 *  the accessor functions.
 *
 *  file_pos_write is defined to return the old file position so that it
 *  can be restored by the caller if appropriate.  (Note that it is not
 *  necessarily guaranteed that restoring the old position will not clobber
 *  a value written by another thread; see below.)  file_pos_adjust is also
 *  defined to return the old file position because it is more often needed
 *  immediately by the caller; the new position can always be obtained by
 *  adding the value passed into the "pos" parameter to file_pos_adjust.
 */

/*
 *  Architectures on which an aligned 64-bit read/write is atomic can omit
 *  locking in file_pos_read and file_pos_write, but not in file_pos_adjust.
 *  Not locking in file_pos_write could lead to a "lost seek" from another
 *  thread between the old position (returned by file_pos_write) and the new
 *  position; any use of the value returned by file_pos_write must take this
 *  into account.
 */
static inline loff_t file_pos_read(const struct file *file)
{
        loff_t oldpos;
        unsigned long flags;

        file_pos_local_acquire(file, flags);
        oldpos = file->f_pos;
        file_pos_local_release(file, flags);
        return oldpos;
}

static inline loff_t file_pos_write(struct file *file, loff_t pos)
{
        loff_t oldpos;
        unsigned long flags;

        file_pos_local_acquire(file, flags);
        oldpos = file->f_pos;
        file->f_pos = pos;
        file_pos_local_release(file, flags);
        return oldpos;
}

/*
 *  file_pos_adjust is logically a read-modify-write and could be replaced
 *  by an atomic 64-bit read-modify-write operation on an architecture where
 *  this is available.  Such an architecture would not need f_pos_lock.
 */
static inline loff_t file_pos_adjust(struct file *file, loff_t offset)
{
        loff_t oldpos;
        unsigned long flags;

        file_pos_local_acquire(file, flags);
        oldpos = file->f_pos;
        file->f_pos = oldpos + offset;
        file_pos_local_release(file, flags);
        return oldpos;
}

/*
 *  file_pos_raw_write and file_pos_raw_adjust are to be called only while
 *  the caller is already holding f_pos_lock.
 */
static inline loff_t file_pos_raw_adjust(struct file *file, loff_t offset)
{
        loff_t oldpos;
        oldpos = file->f_pos;
        file->f_pos = oldpos + offset;
        return oldpos;
}

/*
 *  file_pos_read_safe differs in principle from file_pos_read in that
 *  it should never return a "transient" value (negative due to the race
 *  in llseek or too large due to the overshoot in read/write).  Use it
 *  only when the caller needs an accurate position in a multi-threaded
 *  scenario; it effectively forces a pipeline flush.
 */
static inline loff_t file_pos_read_safe(struct file *file)
{
        loff_t pos;
        unsigned long flags;

        file_pos_excl_acquire(file, flags);
        pos = file->f_pos;
        file_pos_excl_release(file, flags);
        return pos;
}

/*
 *  In the common case (seek to a valid file offset), generic_file_llseek
 *  touches the f_pos member with exactly one accessor (file_pos_write or
 *  file_pos_adjust).  It does not perform a check against s_maxbytes,
 *  because POSIX 2004 doesn't require such a check (EINVAL is supposed to
 *  indicate a bad "whence" argument or an attempt to seek to a negative
 *  offset).  Filesystems are required to verify the reasonableness of the
 *  "pos" argument to all read/write calls, which is not checked by syscall
 *  wrappers such as sys_pwrite.
 *
 *  This implementation contains a race condition which could lead to a
 *  transiently negative value of f_pos being visible to another thread
 *  during an llseek with SEEK_CUR to an invalid offset.  POSIX requires that
 *  "the file offset shall remain unchanged" on error, but doesn't address
 *  atomicity issues when two threads write/seek concurrently on the same
 *  file descriptor.
 *
 *  This function does not check whether the file is a pipe/FIFO/socket and
 *  therefore "incapable of seeking" according to POSIX!  Callers who
 *  require strict POSIX semantics (such as sys_lseek()) must do their own
 *  checks for EBADF, ESPIPE, and EOVERFLOW.  This function does, however,
 *  handle all EINVAL cases according to the POSIX semantics of "a regular
 *  file, block special file, or directory".  It is therefore not suitable
 *  for hypothetical devices for which a negative file offset may be valid.
 */

loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
{
        /* Catch all invalid values of "origin" in one test. */
        if (unlikely(((unsigned int)origin) > SEEK_END))
                return -EINVAL;

        if (origin != SEEK_CUR) {
                /* Seeking to an absolute position is the simple case. */
                loff_t oldpos;
                /*
                 * When generic_file_llseek needs the inode in order to
                 * retrieve the current file size, it goes through f_mapping
                 * (the pagecache entry) rather than through f_path.dentry
                 * (the dentry cache).  All other inode retrievals in
                 * read_write.c go through f_path.dentry.  Why?
                 */
                if (origin == SEEK_END)
                        offset += i_size_read(file->f_mapping->host);
                if (unlikely(offset < 0))
                        return -EINVAL;
                oldpos = file_pos_write(file, offset);
                if (likely(oldpos != offset))
                        file->f_version = 0;
                return offset;
        } else if (unlikely(offset == 0)) {
                /* llseek(fd, 0, SEEK_CUR) is asking to read f_pos.  Use
                 * file_pos_read_safe to avoid returning a transient value. */
                return file_pos_read_safe(file);
        } else {
                loff_t oldpos, retval;
                unsigned long flags;

                file_pos_nonexcl_acquire(file, flags);
                oldpos = file_pos_raw_adjust(file, offset);
                if (unlikely(oldpos + offset < 0)) {
                        /* !!! TRANSIENTLY NEGATIVE f_pos !!! */
                        /*
                         * Several relative seeks could have raced here.
                         * We want to leave f_pos with a non-negative value,
                         * while obeying POSIX in spirit (no change to f_pos
                         * on EINVAL).  But we don't want to require an
                         * atomic read-modify-write that's conditional on the
                         * result of the arithmetic -- at least not on
                         * architectures that lack LL/SC.  So we do something
                         * reasonable: leave a positive value in f_pos that
                         * was the actual file offset before one of the
                         * relative seeks.
                         */
                        if (oldpos >= 0)
                                file_pos_raw_write(file, oldpos);
                        retval = -EINVAL;
                } else {
                        file->f_version = 0;
                        retval = oldpos + offset;
                }
                file_pos_nonexcl_release(file, flags);
                return retval;
        }
        /* Control should never reach the end of this function */
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-13 13:15                     ` Alan Cox
@ 2007-03-14 20:09                       ` Michael K. Edwards
  2007-03-16 16:43                         ` Frank Ch. Eigler
  2007-03-16 17:25                         ` Alan Cox
  0 siblings, 2 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-14 20:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

Thanks Alan, this is really helpful -- I'll see if I can figure out
kprobes.  It is not immediately obvious to me how to use them to trace
function calls in userspace, but I'll read up.

On 3/13/07, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > But on that note -- do you have any idea how one might get ltrace to
> > work on a multi-threaded program, or how one might enhance it to
> > instrument function calls from one shared library to another?  Or
>
> I don't know a vast amount about ARM ELF user space so no.

This isn't ARM-specific, although it's certainly an ELF issue.  ltrace
uses libelf to inspect the program binary and replace calls to shared
library functions with breakpoints.  When the program hits one of
these breakpoints, ltrace inspects the function arguments, puts the
function call back in place of the breakpoint, single steps through
it, puts the breakpoint back, does some sort of sleight of hand to
ensure that it will break again on function return, and continues.
When it breaks again on function return, ltrace inspects the return
value and then continues.

I suppose I should wrap my head around the kwatch stuff and see if
there's a way of doing this without intrusive manipulation of the text
segment, which seems to be the thing preventing ltrace from being
extensible to multi-threaded programs.  The ELF handling code in
ltrace is sufficiently opaque to me that I doubt I could make it work
on function calls inside shared libraries without completely rewriting
it -- at which point I'd rather go the extra mile with kprobes, I
think.

> > better yet, can you advise me on how to induce gdbserver to stream
> > traces of library/syscall entry/exits for all the threads in a
> > process?  And then how to cram it down into the kernel so I don't take
>
> One way to do this is to use kprobes which will do exactly what you want
> as it builds a kernel module. Doesn't currently run on ARM afaik.

There doesn't seem to be an arch/arm/kernel/kprobes.c, and I'm not
sure I have the skill to write one.  On the other hand, ltrace manages
something similar (through libelf), and I can trace ltrace's ptrace
activity with strace, so maybe I can clone that.  :-)

> > the hit for an MMU context switch every time I hit a syscall or
>
> Not easily with gdbstubs as you've got to talk to something to decide how
> to log the data and proceed. If you stick it kernel side its a lot of
> ugly new code and easier to port kprobes over, if you do it remotely as
> gdbstubs intends it adds latencies and screws all your timings.

Sure does.  I do have gdbserver working remotely but it's not as
useful as I hoped because real-world multi-threaded code is so full of
latent races that it dies instantly under gdb.  Unless, of course,
that multi-threaded code was systematically developed using gdb and
qemu and other techniques of flushing out racy designs before their
implementation is entrenched.  Not typical of the embedded world, I'm
sorry to say.

> gdbstubs is also not terribly SMP aware and for low level work its
> sometimes easier to have on gdb per processor if you can get your brain
> around it.

That's a trick I don't know.  What do you do, fire up the target
process, put the whole thing to sleep with a SIGSTOP, and then attach
a separate gdb to each thread after they've been migrated and locked
down to the destination CPU?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-14 20:09                       ` Michael K. Edwards
@ 2007-03-16 16:43                         ` Frank Ch. Eigler
  2007-03-16 17:25                         ` Alan Cox
  1 sibling, 0 replies; 31+ messages in thread
From: Frank Ch. Eigler @ 2007-03-16 16:43 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Alan Cox, Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

"Michael K. Edwards" <medwards.linux@gmail.com> writes:

> Thanks Alan, this is really helpful -- I'll see if I can figure out
> kprobes.  It is not immediately obvious to me how to use them to
> trace function calls in userspace, but I'll read up.

kprobes does not work with userspace.  We in systemtap land have been
pushing for such an extension, and primarily IBM folks have been
writing one that would be acceptable for LKML.

> [...]  There doesn't seem to be an arch/arm/kernel/kprobes.c, and
> I'm not sure I have the skill to write one.  On the other hand,
> ltrace manages something similar (through libelf), and I can trace
> ltrace's ptrace activity with strace, so maybe I can clone that.
> :-)

Efforts are underway to port kprobes to ARM.

Both these efforts may be tracked on the public systemtap mailing list.

- FChE

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-14 20:09                       ` Michael K. Edwards
  2007-03-16 16:43                         ` Frank Ch. Eigler
@ 2007-03-16 17:25                         ` Alan Cox
  1 sibling, 0 replies; 31+ messages in thread
From: Alan Cox @ 2007-03-16 17:25 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Bodo Eggert, Eric Dumazet, Linux Kernel Mailing List

> > gdbstubs is also not terribly SMP aware and for low level work its
> > sometimes easier to have on gdb per processor if you can get your brain
> > around it.
> 
> That's a trick I don't know.  What do you do, fire up the target
> process, put the whole thing to sleep with a SIGSTOP, and then attach
> a separate gdb to each thread after they've been migrated and locked
> down to the destination CPU?

gdbstubs/kgdbstubs is kernel side so you boot the two cores and each core
halts in the debugger on a breakpoint trap. One debug stub uses one
serial port (or one UDP port), the other uses a different one.

Two gdbs and you can stop/play with each processor independently.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09 14:59             ` Benjamin LaHaise
@ 2007-03-10  6:43               ` Michael K. Edwards
  0 siblings, 0 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-10  6:43 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Linux Kernel Mailing List

I apologize for throwing around words like "stupid".  Whether or not
the current semantics can be improved, that's not a constructive way
to characterize them.  I'm sorry.

As three people have ably pointed out :-), the particular case of a
pipe/FIFO isn't seekable and doesn't need the f_pos member anyway
(it's effectively always O_APPEND).  That's what I get for checking
against standards documents at 3AM.  Of course, this has nothing to do
with the point that led me to comment on pipes/FIFOs (which was that
there exist file types that never return 0<ret<nbytes).  And it was in
the context of a very explicit aside that f_pos is not _interesting_
on a pipe/FIFO, except as an indicator of total bytes written.  You
could only peek at this with an (admittedly non-portable) llseek(fd,
0, SEEK_CUR) anyway -- which you would only do for diagnostic
purposes.  But diagnosis of odd corner cases (rarely in my code,
usually in other people's) is what I do day in and day out, so for me
it would be worth having.

In any case, you're all right that the standard doesn't require you to
do anything useful with f_pos on a pipe/FIFO.  But you're permitted to
make it useful if you want to:

<1003.1 lseek()>
The behavior of lseek() on devices which are incapable of seeking is
implementation-defined. The value of the file offset associated with
such a device is undefined.
</1003.1>

Tracking f_pos accurately when writes from multiple threads hit the
same fd (pipe or not) isn't portable, but I recall situations where it
would have been useful.  And if f_pos has to be kept at all in the
uncontended case, it costs you little or nothing to do it in a
thread-safe manner -- as long as you don't overconstrain the semantics
such that you forbid the transient overshoot associated with a short
write.  In fact, unless there's something I've missed, increasing
f_pos before entering vfs_write() happens to be _faster_ than the
current code for common load patterns, both single- and multi-threaded
(although getting the full benefit in the multi-threaded case will
take some fiddling with f_count placement).

I say it costs "little or nothing" only because altering an loff_t
atomically is not free.  But even on x86, with its inability to
atomically modify any 64-bit entity in memory, an uncontended spinlock
on a cacheline already in L1 is so cheap that making the f_pos changes
atomic will (I think) be lost in the noise.

In any case, rewriting read_write.c is proving interesting.  I'll let
you all know if anything comes of it.  In the meantime, thanks for
your (really quite friendly under the circumstances) comments.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09 12:19           ` Michael K. Edwards
  2007-03-09 13:44             ` Eric Dumazet
  2007-03-09 14:10             ` Alan Cox
@ 2007-03-09 14:59             ` Benjamin LaHaise
  2007-03-10  6:43               ` Michael K. Edwards
  2 siblings, 1 reply; 31+ messages in thread
From: Benjamin LaHaise @ 2007-03-09 14:59 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Eric Dumazet, Linux Kernel Mailing List

On Fri, Mar 09, 2007 at 04:19:55AM -0800, Michael K. Edwards wrote:
> On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >Any number of things can cause a short write to occur, and rewinding the
> >file position after the fact is just as bad.  A sane app has to either
> >serialise the writes itself or use a thread safe API like pwrite().
> 
> Not on a pipe/FIFO.  Short writes there are flat out verboten by
> 1003.1 unless O_NONBLOCK is set.  (Not that f_pos is interesting on a
> pipe except as a "bytes sent" indicator  -- and in the multi-threaded
> scenario, if you do the speculative update that I'm suggesting, you
> can't 100% trust it unless you ensure that you are not in
> mid-read/write in some other thread at the moment you sample f_pos.
> But that doesn't make it useless.)

Writes to a pipe/FIFO are atomic, so long as they fit within the pipe buffer 
size, while f_pos on a pipe is undefined -- what exactly is the issue here?  
The semantics you're assuming are not defined by POSIX.  Heck, even looking 
at a man page for one of the *BSDs states "Some devices are incapable of 
seeking.  The value of the pointer associated with such a device is 
undefined."  What part of undefined is problematic?

> As to what a "sane app" has to do: it's just not that unusual to write
> application code that treats a short read/write as a catastrophic
> error, especially when the fd is of a type that is known never to
> produce a short read/write unless something is drastically wrong.  For
> instance, I bomb on short write in audio applications where the driver
> is known to block until enough bytes have been read/written, period.
> When switching from reading a stream of audio frames from thread A to
> reading them from thread B, I may be willing to omit app
> serialization, because I can tolerate an imperfect hand-off in which
> thread A steals one last frame after thread B has started reading --
> as long as the fd doesn't get screwed up.  There is no reason for the
> generic sys_read code to leave a race open in which the same frame is
> read by both threads and a hardware buffer overrun results later.

I hope I don't have to run any of your software.  Short writes can and do 
happen because of a variety of reasons: signals, memory allocation failures, 
quota being exceeded....  These are all error conditions the kernel has to 
provide well defined semantics for, as well behaved applications will try 
to handle them gracefully.

> In short, I'm not proposing that the kernel perfectly serialize
> concurrent reads and writes to arbitrary fd types.  I'm proposing that
> it not do something blatantly stupid and easily avoided in generic
> code that makes it impossible for any fd type to guarantee that, after
> 10 successful pipelined 100-byte reads or writes, f_pos will have
> advanced by 1000.

The semantics you're looking for are defined for regular files with 
O_APPEND.  Anything else is asking for synchronization that other 
applications do not require and do not desire.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09 12:19           ` Michael K. Edwards
  2007-03-09 13:44             ` Eric Dumazet
@ 2007-03-09 14:10             ` Alan Cox
  2007-03-09 14:59             ` Benjamin LaHaise
  2 siblings, 0 replies; 31+ messages in thread
From: Alan Cox @ 2007-03-09 14:10 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Benjamin LaHaise, Eric Dumazet, Linux Kernel Mailing List

> 1003.1 unless O_NONBLOCK is set.  (Not that f_pos is interesting on a
> pipe except as a "bytes sent" indicator  -- and in the multi-threaded

f_pos is undefined on a FIFO or similar object.

> As to what a "sane app" has to do: it's just not that unusual to write
> application code that treats a short read/write as a catastrophic
> error, especially when the fd is of a type that is known never to
> produce a short read/write unless something is drastically wrong.  For

If you are working in a strictly POSIX environment then a signal can
interrupt almost any I/O as a short write even disk I/O. In the sane
world the file I/O cases don't do this.

> as long as the fd doesn't get screwed up.  There is no reason for the
> generic sys_read code to leave a race open in which the same frame is
> read by both threads and a hardware buffer overrun results later.

Audio devices are not seekable anyway.

> concurrent reads and writes to arbitrary fd types.  I'm proposing that
> it not do something blatantly stupid and easily avoided in generic
> code that makes it impossible for any fd type to guarantee that, after
> 10 successful pipelined 100-byte reads or writes, f_pos will have
> advanced by 1000.

You might want to read up on the Unix design philosophy. Things like
record based I/O are user space to avoid kernel complexity and also so
that the overhead of these things is paid only by those who need them
(its kind of RISC for OS design).

Alan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09 12:19           ` Michael K. Edwards
@ 2007-03-09 13:44             ` Eric Dumazet
  2007-03-09 14:10             ` Alan Cox
  2007-03-09 14:59             ` Benjamin LaHaise
  2 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2007-03-09 13:44 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Benjamin LaHaise, Linux Kernel Mailing List

On Friday 09 March 2007 13:19, Michael K. Edwards wrote:
> On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Any number of things can cause a short write to occur, and rewinding the
> > file position after the fact is just as bad.  A sane app has to either
> > serialise the writes itself or use a thread safe API like pwrite().
>
> Not on a pipe/FIFO.  Short writes there are flat out verboten by
> 1003.1 unless O_NONBLOCK is set.  (Not that f_pos is interesting on a
> pipe except as a "bytes sent" indicator  -- and in the multi-threaded
> scenario, if you do the speculative update that I'm suggesting, you
> can't 100% trust it unless you ensure that you are not in
> mid-read/write in some other thread at the moment you sample f_pos.
> But that doesn't make it useless.)
Hello Michael

When was the last time you checked standards ? Please read them again, and 
stop disinforming people.

http://www.opengroup.org/onlinepubs/007908775/xsh/write.html

	"On a file not capable of seeking, writing always takes place starting at the
	 current position. The value of a file offset associated with such a device
	is undefined."

A pipe/FIFO is not capable of seeking.

I let you make the conclusion of these two points.

A conformant kernel is free to not touch f_pos for non capable seeking files 
(pipes, sockets, ...), or to put any value in it.

Current code does that not because of lazy programmers, but because its 
generic, and adding special cases (tests + conditional branches) just slow 
down the code and make it larger.

>
> As to what a "sane app" has to do: it's just not that unusual to write
> application code that treats a short read/write as a catastrophic
> error, especially when the fd is of a type that is known never to
> produce a short read/write unless something is drastically wrong.  For
> instance, I bomb on short write in audio applications where the driver
> is known to block until enough bytes have been read/written, period.
> When switching from reading a stream of audio frames from thread A to
> reading them from thread B, I may be willing to omit app
> serialization, because I can tolerate an imperfect hand-off in which
> thread A steals one last frame after thread B has started reading --
> as long as the fd doesn't get screwed up.  There is no reason for the
> generic sys_read code to leave a race open in which the same frame is
> read by both threads and a hardware buffer overrun results later.

Don't assume your app is sane while the kernel is not. It's not very fair :

Show us the source code so that we can agree with you or disagree.

Also, I've seen some Unixes (namely AIX IBM) that could return a partial write 
even on a regular file on regular file system. An easy way to trigger this 
was to launch a debugger/syscall_tracer on the live process while it was 
doing a big write(). Most 'sane apps' were ignoring the partial return or 
just throw an exception.

Even on 'cleaner Unixes', a write() near the ulimit -f may return a partial 
count on a regular file.

>
> In short, I'm not proposing that the kernel perfectly serialize
> concurrent reads and writes to arbitrary fd types.  I'm proposing that
> it not do something blatantly stupid and easily avoided in generic
> code that makes it impossible for any fd type to guarantee that, after
> 10 successful pipelined 100-byte reads or writes, f_pos will have
> advanced by 1000.
>

Before saying current linux code is "blatantly stupid and easily avoided", 
just post your patches so that we can check them and eventually say : 

Oh yes, Michael was right and {we|they} were "stupid" all these years

Thank you

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09  1:34         ` Benjamin LaHaise
@ 2007-03-09 12:19           ` Michael K. Edwards
  2007-03-09 13:44             ` Eric Dumazet
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-09 12:19 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Linux Kernel Mailing List

On 3/8/07, Benjamin LaHaise <bcrl@kvack.org> wrote:
> Any number of things can cause a short write to occur, and rewinding the
> file position after the fact is just as bad.  A sane app has to either
> serialise the writes itself or use a thread safe API like pwrite().

Not on a pipe/FIFO.  Short writes there are flat out verboten by
1003.1 unless O_NONBLOCK is set.  (Not that f_pos is interesting on a
pipe except as a "bytes sent" indicator  -- and in the multi-threaded
scenario, if you do the speculative update that I'm suggesting, you
can't 100% trust it unless you ensure that you are not in
mid-read/write in some other thread at the moment you sample f_pos.
But that doesn't make it useless.)

As to what a "sane app" has to do: it's just not that unusual to write
application code that treats a short read/write as a catastrophic
error, especially when the fd is of a type that is known never to
produce a short read/write unless something is drastically wrong.  For
instance, I bomb on short write in audio applications where the driver
is known to block until enough bytes have been read/written, period.
When switching from reading a stream of audio frames from thread A to
reading them from thread B, I may be willing to omit app
serialization, because I can tolerate an imperfect hand-off in which
thread A steals one last frame after thread B has started reading --
as long as the fd doesn't get screwed up.  There is no reason for the
generic sys_read code to leave a race open in which the same frame is
read by both threads and a hardware buffer overrun results later.

In short, I'm not proposing that the kernel perfectly serialize
concurrent reads and writes to arbitrary fd types.  I'm proposing that
it not do something blatantly stupid and easily avoided in generic
code that makes it impossible for any fd type to guarantee that, after
10 successful pipelined 100-byte reads or writes, f_pos will have
advanced by 1000.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09  5:53         ` Eric Dumazet
@ 2007-03-09 11:52           ` Michael K. Edwards
  0 siblings, 0 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-09 11:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List

On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Dont even try, you *cannot* do that, without breaking the standards, or
> without a performance drop.
>
> The only safe way would be to lock the file during the whole read()/write()
> syscall, and we dont want this (this would be more expensive than current)
> Dont forget 'file' may be some sockets/tty/whatever, not a regular file.

I'm not trying to provide full serialization of concurrent
multi-threaded read()/write() in all exceptional scenarios.  I am
trying to think through the semantics of pipelined I/O operations on
struct file.  In the _absence_ of an exception, something sane should
happen -- and when you start at f_pos == 1000 and write 100 bytes each
from two threads (successfully), leaving f_pos at 1100 is not sane.

> Standards are saying :
>
> If an error occurs, file pointer remains unchanged.

>From 1003.1, 2004 edition:

<standard>
This volume of IEEE Std 1003.1-2001 does not specify the value of the
file offset after an error is returned; there are too many cases. For
programming errors, such as [EBADF], the concept is meaningless since
no file is involved. For errors that are detected immediately, such as
[EAGAIN], clearly the pointer should not change. After an interrupt or
hardware error, however, an updated value would be very useful and is
the behavior of many implementations.

This volume of IEEE Std 1003.1-2001 does not specify behavior of
concurrent writes to a file from multiple processes. Applications
should use some form of concurrency control.
</standard>

The effect on f_pos of an error during concurrent writes is therefore
doubly unconstrained.  In the absence of concurrent writes, it is
quite harmless for f_pos to have transiently contained, at some point
during the execution of write(), an overestimate of the file position
after the write().  In the presence of concurrent writes (let us say
two 100-byte writes to a file whose f_pos starts at 1000), it is
conceivable that the second write will succeed at f_pos == 1100 but
the first will be short (let us say only 50 bytes are written),
leaving f_pos at 1150 and no bytes written in the range 1050 to 1099.
That would suck -- but the standard does not oblige you to avoid it
unless the destination is a pipe or FIFO with O_NONBLOCK clear, in
which case partial writes are flat out verboten.

> You cannot know for sure how many bytes will be written, since write() can
> returns a count that is different than buflen.

Of course (except in the pipe/FIFO case).  But if it does, and you're
writing concurrently to the fd, you're not obliged to do anything sane
at all.  If you're not writing concurrently, the fact that you
overshot and then fixed it up after vfs_write() returned is _totally_
invisible.  f_pos is local to the struct file.

> So you cannot update fpos before calling vfs_write()

You can speculatively update it, and in the common case you don't have
to touch it again.  That's a win.

> About your L1 'miss', dont forget that multi-threaded apps are going to
> atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of
> syscall. And you were concerned about multi-threaded apps, didnt you ?

That does indeed interfere with the optimization for multi-threaded
apps.  Which doesn't mean it isn't worth having for single-threaded
apps.  And if we get to the point that that atomic_dec_and_test is the
only thing left (in the common case) that touches the struct file
after a VFS operation completes, then we can evaluate whether f_count
ought to be split out of the struct file and put somewhere else.  In
fact, if I understand the calls inside vfs_write() correctly, f_count
is (usually?) the only member of struct file that is written to during
a call to sys_pwrite64(); so moving it out of struct file and into
some place where it has to be kept cache-coherent anyway would also
improve the performance on SMP of distributed pwrite() calls to a
process-global fd.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09  0:45       ` Michael K. Edwards
  2007-03-09  1:34         ` Benjamin LaHaise
@ 2007-03-09  5:53         ` Eric Dumazet
  2007-03-09 11:52           ` Michael K. Edwards
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-09  5:53 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Linux Kernel Mailing List

Michael K. Edwards a écrit :
> On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Absolutely not. We dont want to slow down kernel 'just in case a fool 
>> might
>> want to do crazy things'
> 
> Actually, I think it would make the kernel (negligibly) faster to bump
> f_pos before the vfs_write() call.  Unless fget_light sets fput_needed
> or the write doesn't complete cleanly, you won't have to touch the
> file table entry again after vfs_write() returns.  You can adjust
> vfs_write to grab f_dentry out of the file before going into
> do_sync_write.  do_sync_write is done with the struct file before it
> goes into the aio_write() loop.  Result: you probably save at least an
> L1 cache miss, unless the aio_write loop is so frugal with L1 cache
> that it doesn't manage to evict the struct file.
> 
> Patch to follow.

Dont even try, you *cannot* do that, without breaking the standards, or 
without a performance drop.

The only safe way would be to lock the file during the whole read()/write() 
syscall, and we dont want this (this would be more expensive than current)
Dont forget 'file' may be some sockets/tty/whatever, not a regular file.

Standards are saying :

If an error occurs, file pointer remains unchanged.

You cannot know for sure how many bytes will be written, since write() can 
returns a count that is different than buflen.

So you cannot update fpos before calling vfs_write()

About your L1 'miss', dont forget that multi-threaded apps are going to 
atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of 
syscall. And you were concerned about multi-threaded apps, didnt you ?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09  0:45       ` Michael K. Edwards
@ 2007-03-09  1:34         ` Benjamin LaHaise
  2007-03-09 12:19           ` Michael K. Edwards
  2007-03-09  5:53         ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Benjamin LaHaise @ 2007-03-09  1:34 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Eric Dumazet, Linux Kernel Mailing List

> Actually, I think it would make the kernel (negligibly) faster to bump
> f_pos before the vfs_write() call.  Unless fget_light sets fput_needed
> or the write doesn't complete cleanly, you won't have to touch the
> file table entry again after vfs_write() returns.  You can adjust

Any number of things can cause a short write to occur, and rewinding the 
file position after the fact is just as bad.  A sane app has to either 
serialise the writes itself or use a thread safe API like pwrite().

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-09  0:15     ` Eric Dumazet
@ 2007-03-09  0:45       ` Michael K. Edwards
  2007-03-09  1:34         ` Benjamin LaHaise
  2007-03-09  5:53         ` Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-09  0:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List

On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Absolutely not. We dont want to slow down kernel 'just in case a fool might
> want to do crazy things'

Actually, I think it would make the kernel (negligibly) faster to bump
f_pos before the vfs_write() call.  Unless fget_light sets fput_needed
or the write doesn't complete cleanly, you won't have to touch the
file table entry again after vfs_write() returns.  You can adjust
vfs_write to grab f_dentry out of the file before going into
do_sync_write.  do_sync_write is done with the struct file before it
goes into the aio_write() loop.  Result: you probably save at least an
L1 cache miss, unless the aio_write loop is so frugal with L1 cache
that it doesn't manage to evict the struct file.

Patch to follow.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-08 23:08 Michael K. Edwards
  2007-03-08 23:43 ` Eric Dumazet
@ 2007-03-09  0:43 ` Alan Cox
  1 sibling, 0 replies; 31+ messages in thread
From: Alan Cox @ 2007-03-09  0:43 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Linux Kernel Mailing List

> Surely that's racy when two threads write to the same fd and the first
> vfs_write call blocks or is preempted.  Or does fget_light take some
> per-file lock that I'm not seeing?

I think you are making assumptions about file position behaviour that are
simply not guaranteed in the standards document.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-08 23:57   ` Michael K. Edwards
@ 2007-03-09  0:15     ` Eric Dumazet
  2007-03-09  0:45       ` Michael K. Edwards
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-09  0:15 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Linux Kernel Mailing List

Michael K. Edwards a écrit :
> On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Nothing in the manuals says that write() on same fd should be non racy 
>> : In
>> particular file pos might be undefined. There is a reason pwrite() 
>> exists.
>>
>> Kernel doesnt have to enforce thread safety as standard is quite clear.
> 
> I know the standard _allows_ us to crash and burn (well, corrupt
> f_pos) when two threads race on an fd, but why would we want to?
> Wouldn't it be better to do something at least slightly sane, like add
> atomically to f_pos the expected number of number of bytes written,
> then do the write, then fix it up (again atomically) if vfs_write
> returns an unexpected pos?

Absolutely not. We dont want to slow down kernel 'just in case a fool might 
want to do crazy things'


> 
>> Only O_APPEND case is specially handled (and NFS might fail to handle 
>> this
>> case correctly)
> 
> Is it?  How?
mm/filemap.c

generic_write_checks()

if (file->f_flags & O_APPEND)
	*pos = i_size_read(inode);

done while inode is locked.

O_APPEND basically says : Just ignore fpos and always use the 'end of file'


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-08 23:43 ` Eric Dumazet
@ 2007-03-08 23:57   ` Michael K. Edwards
  2007-03-09  0:15     ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-08 23:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List

On 3/8/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Nothing in the manuals says that write() on same fd should be non racy : In
> particular file pos might be undefined. There is a reason pwrite() exists.
>
> Kernel doesnt have to enforce thread safety as standard is quite clear.

I know the standard _allows_ us to crash and burn (well, corrupt
f_pos) when two threads race on an fd, but why would we want to?
Wouldn't it be better to do something at least slightly sane, like add
atomically to f_pos the expected number of number of bytes written,
then do the write, then fix it up (again atomically) if vfs_write
returns an unexpected pos?

> Only O_APPEND case is specially handled (and NFS might fail to handle this
> case correctly)

Is it?  How?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: sys_write() racy for multi-threaded append?
  2007-03-08 23:08 Michael K. Edwards
@ 2007-03-08 23:43 ` Eric Dumazet
  2007-03-08 23:57   ` Michael K. Edwards
  2007-03-09  0:43 ` Alan Cox
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2007-03-08 23:43 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Linux Kernel Mailing List

Michael K. Edwards a écrit :
> from sys_write():
> 
>        file = fget_light(fd, &fput_needed);
>        if (file) {
>                loff_t pos = file_pos_read(file);
>                ret = vfs_write(file, buf, count, &pos);
>                file_pos_write(file, pos);
>                fput_light(file, fput_needed);
>        }
> 
> Surely that's racy when two threads write to the same fd and the first
> vfs_write call blocks or is preempted.  Or does fget_light take some
> per-file lock that I'm not seeing?

Nothing in the manuals says that write() on same fd should be non racy : In 
particular file pos might be undefined. There is a reason pwrite() exists.

Kernel doesnt have to enforce thread safety as standard is quite clear.

Only O_APPEND case is specially handled (and NFS might fail to handle this 
case correctly)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* sys_write() racy for multi-threaded append?
@ 2007-03-08 23:08 Michael K. Edwards
  2007-03-08 23:43 ` Eric Dumazet
  2007-03-09  0:43 ` Alan Cox
  0 siblings, 2 replies; 31+ messages in thread
From: Michael K. Edwards @ 2007-03-08 23:08 UTC (permalink / raw)
  To: Linux Kernel Mailing List

from sys_write():

        file = fget_light(fd, &fput_needed);
        if (file) {
                loff_t pos = file_pos_read(file);
                ret = vfs_write(file, buf, count, &pos);
                file_pos_write(file, pos);
                fput_light(file, fput_needed);
        }

Surely that's racy when two threads write to the same fd and the first
vfs_write call blocks or is preempted.  Or does fget_light take some
per-file lock that I'm not seeing?

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2007-03-16 16:45 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <7WzUo-1zl-21@gated-at.bofh.it>
     [not found] ` <7WAx2-2pg-21@gated-at.bofh.it>
     [not found]   ` <7WAGF-2Bx-9@gated-at.bofh.it>
     [not found]     ` <7WB07-3g5-33@gated-at.bofh.it>
     [not found]       ` <7WBt7-3SZ-23@gated-at.bofh.it>
2007-03-12  7:53         ` sys_write() racy for multi-threaded append? Bodo Eggert
2007-03-12 16:26           ` Michael K. Edwards
2007-03-12 18:48             ` Bodo Eggert
2007-03-13  0:46               ` Michael K. Edwards
2007-03-13  2:24                 ` Alan Cox
2007-03-13  7:25                   ` Michael K. Edwards
2007-03-13  7:42                     ` David Miller
2007-03-13 16:24                       ` Michael K. Edwards
2007-03-13 17:59                         ` Michael K. Edwards
2007-03-13 19:09                           ` Christoph Hellwig
2007-03-13 23:40                             ` Michael K. Edwards
2007-03-14  0:09                               ` Michael K. Edwards
2007-03-13 13:15                     ` Alan Cox
2007-03-14 20:09                       ` Michael K. Edwards
2007-03-16 16:43                         ` Frank Ch. Eigler
2007-03-16 17:25                         ` Alan Cox
2007-03-13 14:00                   ` David M. Lloyd
2007-03-08 23:08 Michael K. Edwards
2007-03-08 23:43 ` Eric Dumazet
2007-03-08 23:57   ` Michael K. Edwards
2007-03-09  0:15     ` Eric Dumazet
2007-03-09  0:45       ` Michael K. Edwards
2007-03-09  1:34         ` Benjamin LaHaise
2007-03-09 12:19           ` Michael K. Edwards
2007-03-09 13:44             ` Eric Dumazet
2007-03-09 14:10             ` Alan Cox
2007-03-09 14:59             ` Benjamin LaHaise
2007-03-10  6:43               ` Michael K. Edwards
2007-03-09  5:53         ` Eric Dumazet
2007-03-09 11:52           ` Michael K. Edwards
2007-03-09  0:43 ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).