Re: sys_write() racy for multi-threaded append?

From: "Michael K. Edwards" <medwards.linux@gmail.com>
To: 7eggert@gmx.de
Cc: "Eric Dumazet" <dada1@cosmosbay.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Subject: Re: sys_write() racy for multi-threaded append?
Date: Mon, 12 Mar 2007 09:26:22 -0700	[thread overview]
Message-ID: <f2b55d220703120926k1ab7112fh60227fac670b8b4c@mail.gmail.com> (raw)
In-Reply-To: <E1HQfLX-0000fk-68@be1.lrz>

On 3/12/07, Bodo Eggert <7eggert@gmx.de> wrote:
> Michael K. Edwards <medwards.linux@gmail.com> wrote:
> > Actually, I think it would make the kernel (negligibly) faster to bump
> > f_pos before the vfs_write() call.
>
> This is a security risk.
>
> ----------------
> other process:
> unlink(secrest_file)
>
> Thread 1:
> write(fd, large)
> (interrupted)
>
> Thread 2:
> fseek(fd, -n, relative)
> read(fd, buf)
> ----------------

I don't entirely follow this.  Which process is supposed to be secure
here, and what information is supposed to be secret from whom?  But in
any case, are you aware that f_pos is clamped to the inode's
max_bytes, not to the current file size?  Thread 2 can seek/read past
the position at which Thread 1's write began, whether f_pos is bumped
before or after the vfs_write.

> BTW: The best thing you can do to a program where two threads race for
> writing one fd is to let it crash and burn in the most spectacular way
> allowed without affecting the rest of the system, unless it happens to
> be a pipe and the number of bytes written is less than PIPE_MAX.

That's fine when you're doing integration test, and should probably be
the default during development.  But if the race is first exposed in
the field, or if the developer is trying to concentrate on a different
problem, "spectacular crash and burn" may do more harm than good.
It's easy enough to refactor the f_pos handling in the kernel so that
it all goes through three or four inline accessor functions, at which
point you can choose your trade-off between speed and idiot-proofness
-- at _kernel_ compile time, or (given future hardware that supports
standardized optionally-atomic-based-on-runtime-flag operations) per
process at run-time.

Frankly, I think that unless application programmers poke at some sort
of magic "I promise to handle short writes correctly" bit, write()
should always return either the full number of bytes requested or an
error code.  If they do poke at that bit, the (development) kernel
should deliberately split writes a few percent of the time, just to
exercise the short-write code paths.  And in order to find out where
that magic bit is, they should have to read the kernel code or ask on
LKML (and get the "standard lecture").

Really very IEEE754-like, actually.  (Harp, harp.)

Cheers,
- Michael