Re: sys_write() racy for multi-threaded append?

From: "Michael K. Edwards" <medwards.linux@gmail.com>
To: "David Miller" <davem@davemloft.net>
Cc: alan@lxorguk.ukuu.org.uk, 7eggert@gmx.de, dada1@cosmosbay.com,
	linux-kernel@vger.kernel.org
Subject: Re: sys_write() racy for multi-threaded append?
Date: Tue, 13 Mar 2007 09:24:22 -0700	[thread overview]
Message-ID: <f2b55d220703130924h11786d9fh1d87f90bcd1b3d60@mail.gmail.com> (raw)
In-Reply-To: <20070313.004224.41634994.davem@davemloft.net>

On 3/13/07, David Miller <davem@davemloft.net> wrote:
> You're not even safe over standard output, simply run the program over
> ssh and you suddenly have socket semantics to deal with.

I'm intimately familiar with this one.  Easily worked around by piping
the output through cat or tee.  Not that one should ever write code
for a *nix box that can't cope with stdout being a socket or tty; but
sometimes the quickest way to glue existing code into a pipeline is to
pass /dev/stdout in place of a filename, and there's no real value in
reworking legacy code to handle short writes when you can just drop in
cat.

> In short, if you don't handle short writes, you're writing a program
> for something other than unix.

Right in one.  You're writing a program for a controlled environment,
whether it's a Linux-only API (netlink sockets) or a Linux-only
embedded product.  And there's no need for that controlled environment
to gratuitously reproduce the overly vague semantics of the *nix zoo.

> We're not changing write() to interlock with other parallel callers or
> messing with the f_pos semantics in such cases, that's stupid, please
> cope, kthx.

This is not what I am proposing, and certainly not what I'm
implementing.  Right now f_pos handling is gratuitously thread-unsafe
even in the common, all writes completed normally case.  writev()
opens the largest possible window to f_pos corruption by delaying the
f_pos update until after the vfs_writev() completes.  That's the main
thing I want to see fixed.

_If_ it proves practical to wrap accessor functions around f_pos
accesses, and _if_ accesses to f_pos from outside read_write.c can be
made robust against transient overshoot, then there is no harm in
tightening up f_pos semantics so that successful pipelined writes to
the same file don't collide so easily.  If read-modify-write on f_pos
were atomic, it would be easy to guarantee that it is in a sane state
any time a syscall is not actually outstanding on that struct file.
There's even a potential performance gain from streamlining the
pattern of L1 cache usage in the common case, although I expect it's
negligible.

Making read-modify-write atomic on f_pos is of course not free on all
platforms.  But once you have the accessors, you can decide at kernel
compile time whether or not to interlock them with spinlocks and
memory barriers or some such.  Most platforms probably needn't bother;
there are much bigger time bombs lurking elsewhere in the kernel.  x86
is a bit dicey, because there's no way to atomically update a 64-bit
loff_t in memory, and so in principle you could get a race on carry
across the 32-bit boundary.  (This file instantaneously jumped from
4GB to 8GB in size?  What the hell?  How long would it take you to
spot a preemption between writing out the upper and lower halves of a
64-bit integer?)

In any case, I have certainly gotten the -ENOPATCH message by now.  It
must be nice not to have to care whether things work in the field if
you can point at something stupid that an application programmer did.
I don't always have that luxury.

Cheers,
- Michael