All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Update of file offset on write() etc. is non-atomic with I/O
       [not found] <a8df285f-de7f-4a3a-9a19-e0ad07ab3a5c@blur>
@ 2014-02-20 18:15   ` Zuckerman, Boris
  0 siblings, 0 replies; 43+ messages in thread
From: Zuckerman, Boris @ 2014-02-20 18:15 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk (man-pages)
  Cc: lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, Al Viro,
	J. Bruce Fields, Yongzhi Pan

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 6053 bytes --]

Hi,

You probably already considered that - sorry, if so…

Instead of the mutex Windows use ExecutiveResource with shared and exclusive semantics. Readers serialize by taking the resource shared and writers take it exclusive. I have that implemented for Linux. Please, let me know if there is any interest!


Sent from my Verizon Wireless Droid


-----Original message-----
From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: lkml <linux-kernel@vger.kernel.org>, Miklos Szeredi <miklos@szeredi.hu>, Theodore T&apos;so <tytso@mit.edu>, Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Dave Chinner <david@fromorbit.com>, Linux-Fsdevel <linux-fsdevel@vger.kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, "J. Bruce Fields" <bfields@citi.umich.edu>, Yongzhi Pan <panyongzhi@gmail.com>
Sent: Thu, Feb 20, 2014 17:15:07 GMT+00:00
Subject: Re: Update of file offset on write() etc. is non-atomic with I/O

Yes, I do think we violate POSIX here because of how we handle f_pos
(the earlier thread from 2006 you point to talks about the "thread
safe" part, the point here about the actual wording of "atomic" is a
separate issue).

Long long ago we used to just pass in the pointer to f_pos directly,
and then the low-level write would update it all under the inode
semaphore, and all was good.

And then we ended up having tons of problems with non-regular files
and drivers accessing f_pos and having nasty races with it because
they did *not* have any locking (and very fundamentally didn't want
any), and we broke the serialization of f_pos. We still do the *IO*
atomically, but yes, the f_pos access itself is outside the lock.

Ho humm.. Al, any ideas of how to fix this?

                     Linus



On Mon, Feb 17, 2014 at 7:41 AM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello all,
>
> A note from Yongzhi Pan about some of my own code led me to dig deeper
> and discover behavior that is surprising and also seems to be a
> fairly clear violation of POSIX requirements.
>
> It appears that write() (and, presumably read() and other similar
> system calls) are not atomic with respect to performing I/O and
> updating the file offset behavior.
>
> The problem can be demonstrated using the program below.
> That program takes three arguments:
>
> $ ./multi_writer num-children num-blocks block-size > somefile
>
> It creates 'num-children' children, each of which writes 'num-blocks'
> blocks of 'block-size' bytes to standard output; for my experiments,
> stdout is redirected to a file. After all children have finished,
> the parent inspects the size of the file written on stdout, calculates
> the expected size of the file, and displays these two values, and
> their difference on stderr.
>
> Some observations:
>
> * All children inherit the stdout file descriptor from the parent;
>   thus the FDs refer to the same open file description, and therefore
>   share the file offset.
>
> * When I run this on a multi-CPU BSD systems, I get the expected result:
>
> $ ./multi_writer 10 10000 1000 > g 2> run.log
> $ ls -l g
> -rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g
>
> * Someone else tested this code for me on a Solaris system, and also got
>   the expected result.
>
> * On Linux, by contrast, we see behavior such as the following:
>
> $ ./multi_writer 10 10000 1000 > g
> Expected file size:  100000000
> Actual file size:     16323000
> Difference:           83677000
> $ ls -l g
> -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g
>
> Summary of the above output: some children are overwriting the output
> of other children because output is not atomic with respect to updates
> to the file offset.
>
> For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:
>
> [[
> 2.9.7 Thread Interactions with Regular File Operations
>
> All of the following functions shall be atomic with respect to each other
> in the effects specified in POSIX.1-2008 when they operate on regular
> files or symbolic links:
>
>
> chmod()
> ...
> pread()
> read()
> ...
> readv()
> pwrite()
> ...
> write()
> writev()
>
>
> If two threads each call one of these functions, each call shall either
> see all of the specified effects of the other call, or none of them.
> ]]
>
> (POSIX.1-2001 has similar text.)
>
> This text is in one of the Threads sections, but it applies equally
> to threads in different processes as to threads in the same process.
>
> I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a
> number of other recent kernels, all with similar results, which suggests
> the result is in the VFS layer. (Can it really be so simple as no locking
> around pieces such as
>
>                 loff_t pos = file_pos_read(f.file);
>                 ret = vfs_write(f.file, buf, count, &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>
> in fs/read_write.c?)
>
> I discovered this behavior after Yongzhi Pan reported some unexpected
> behavior in some of my code that forked to create a parent and
> child that wrote to the same file. In some cases, expected output
> was not appearing. In other words, after a fork(), and in the absence
> of any other synchronization technique, a parent and a child cannot
> safely write to the same file descriptor without risking overwriting
> each other's output. But POSIX requires this, and other systems seem
> to guarantee it.
>
> Am I correct to think there's a kernel problem here?
>
> Thanks,
>
> Michael
>
> ===
>
> /* multi_writer.c
> */
>
> #include <sys/wait.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <errno.h>
>
> typedef enum { FALSE, TRUE } Boolean;
>
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>             &nbs
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
@ 2014-02-20 18:15   ` Zuckerman, Boris
  0 siblings, 0 replies; 43+ messages in thread
From: Zuckerman, Boris @ 2014-02-20 18:15 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk (man-pages)
  Cc: lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, Al Viro,
	J. Bruce Fields, Yongzhi Pan

Hi,

You probably already considered that - sorry, if so…

Instead of the mutex Windows use ExecutiveResource with shared and exclusive semantics. Readers serialize by taking the resource shared and writers take it exclusive. I have that implemented for Linux. Please, let me know if there is any interest!


Sent from my Verizon Wireless Droid


-----Original message-----
From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: lkml <linux-kernel@vger.kernel.org>, Miklos Szeredi <miklos@szeredi.hu>, Theodore T&apos;so <tytso@mit.edu>, Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Dave Chinner <david@fromorbit.com>, Linux-Fsdevel <linux-fsdevel@vger.kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, "J. Bruce Fields" <bfields@citi.umich.edu>, Yongzhi Pan <panyongzhi@gmail.com>
Sent: Thu, Feb 20, 2014 17:15:07 GMT+00:00
Subject: Re: Update of file offset on write() etc. is non-atomic with I/O

Yes, I do think we violate POSIX here because of how we handle f_pos
(the earlier thread from 2006 you point to talks about the "thread
safe" part, the point here about the actual wording of "atomic" is a
separate issue).

Long long ago we used to just pass in the pointer to f_pos directly,
and then the low-level write would update it all under the inode
semaphore, and all was good.

And then we ended up having tons of problems with non-regular files
and drivers accessing f_pos and having nasty races with it because
they did *not* have any locking (and very fundamentally didn't want
any), and we broke the serialization of f_pos. We still do the *IO*
atomically, but yes, the f_pos access itself is outside the lock.

Ho humm.. Al, any ideas of how to fix this?

                     Linus



On Mon, Feb 17, 2014 at 7:41 AM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello all,
>
> A note from Yongzhi Pan about some of my own code led me to dig deeper
> and discover behavior that is surprising and also seems to be a
> fairly clear violation of POSIX requirements.
>
> It appears that write() (and, presumably read() and other similar
> system calls) are not atomic with respect to performing I/O and
> updating the file offset behavior.
>
> The problem can be demonstrated using the program below.
> That program takes three arguments:
>
> $ ./multi_writer num-children num-blocks block-size > somefile
>
> It creates 'num-children' children, each of which writes 'num-blocks'
> blocks of 'block-size' bytes to standard output; for my experiments,
> stdout is redirected to a file. After all children have finished,
> the parent inspects the size of the file written on stdout, calculates
> the expected size of the file, and displays these two values, and
> their difference on stderr.
>
> Some observations:
>
> * All children inherit the stdout file descriptor from the parent;
>   thus the FDs refer to the same open file description, and therefore
>   share the file offset.
>
> * When I run this on a multi-CPU BSD systems, I get the expected result:
>
> $ ./multi_writer 10 10000 1000 > g 2> run.log
> $ ls -l g
> -rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g
>
> * Someone else tested this code for me on a Solaris system, and also got
>   the expected result.
>
> * On Linux, by contrast, we see behavior such as the following:
>
> $ ./multi_writer 10 10000 1000 > g
> Expected file size:  100000000
> Actual file size:     16323000
> Difference:           83677000
> $ ls -l g
> -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g
>
> Summary of the above output: some children are overwriting the output
> of other children because output is not atomic with respect to updates
> to the file offset.
>
> For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:
>
> [[
> 2.9.7 Thread Interactions with Regular File Operations
>
> All of the following functions shall be atomic with respect to each other
> in the effects specified in POSIX.1-2008 when they operate on regular
> files or symbolic links:
>
>
> chmod()
> ...
> pread()
> read()
> ...
> readv()
> pwrite()
> ...
> write()
> writev()
>
>
> If two threads each call one of these functions, each call shall either
> see all of the specified effects of the other call, or none of them.
> ]]
>
> (POSIX.1-2001 has similar text.)
>
> This text is in one of the Threads sections, but it applies equally
> to threads in different processes as to threads in the same process.
>
> I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a
> number of other recent kernels, all with similar results, which suggests
> the result is in the VFS layer. (Can it really be so simple as no locking
> around pieces such as
>
>                 loff_t pos = file_pos_read(f.file);
>                 ret = vfs_write(f.file, buf, count, &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>
> in fs/read_write.c?)
>
> I discovered this behavior after Yongzhi Pan reported some unexpected
> behavior in some of my code that forked to create a parent and
> child that wrote to the same file. In some cases, expected output
> was not appearing. In other words, after a fork(), and in the absence
> of any other synchronization technique, a parent and a child cannot
> safely write to the same file descriptor without risking overwriting
> each other's output. But POSIX requires this, and other systems seem
> to guarantee it.
>
> Am I correct to think there's a kernel problem here?
>
> Thanks,
>
> Michael
>
> ===
>
> /* multi_writer.c
> */
>
> #include <sys/wait.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <errno.h>
>
> typedef enum { FALSE, TRUE } Boolean;
>
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>             &nbs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-20 18:15   ` Zuckerman, Boris
  (?)
@ 2014-02-20 18:29   ` Al Viro
  2014-02-21  6:01     ` Michael Kerrisk (man-pages)
  -1 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-02-20 18:29 UTC (permalink / raw)
  To: Zuckerman, Boris
  Cc: Linus Torvalds, Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Thu, Feb 20, 2014 at 06:15:15PM +0000, Zuckerman, Boris wrote:
> Hi,
> 
> You probably already considered that - sorry, if so…
> 
> Instead of the mutex Windows use ExecutiveResource with shared and exclusive semantics. Readers serialize by taking the resource shared and writers take it exclusive. I have that implemented for Linux. Please, let me know if there is any interest!

See include/linux/rwsem.h...

Anyway, the really interesting question here is what does POSIX promise
wrt lseek() vs. write().  What warranties are given there?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-20 18:29   ` Al Viro
@ 2014-02-21  6:01     ` Michael Kerrisk (man-pages)
  2014-02-23  1:18       ` Kevin Easton
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-21  6:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Zuckerman, Boris, Linus Torvalds, lkml, Miklos Szeredi,
	Theodore T'so, Christoph Hellwig, Chris Mason, Dave Chinner,
	Linux-Fsdevel, J. Bruce Fields, Yongzhi Pan

On Thu, Feb 20, 2014 at 7:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 20, 2014 at 06:15:15PM +0000, Zuckerman, Boris wrote:
>> Hi,
>>
>> You probably already considered that - sorry, if so...
>>
>> Instead of the mutex Windows use ExecutiveResource with shared and exclusive semantics. Readers serialize by taking the resource shared and writers take it exclusive. I have that implemented for Linux. Please, let me know if there is any interest!
>
> See include/linux/rwsem.h...
>
> Anyway, the really interesting question here is what does POSIX promise
> wrt lseek() vs. write().  What warranties are given there?

I suppose you are wondering about cases such as:

Process A                     Process B
write():                      lseek()
    perform I/O
                              update f_pos
    update f_pos()

In my reading of POSIX, lseeek() and write() should be atomic w.r.t.
each other, and the above should not be allowed.

Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7:

[[
2.9.7 Thread Interactions with Regular File Operations

All of the following functions shall be atomic with respect to each
other in the effects specified in
POSIX.1-2008 when they operate on regular files or symbolic links:

chmod( )
chown( )
close( )
creat( )
dup2( )
fchmod( )
fchmodat( )
fchown( )
fchownat( )
fcntl( )
fstat( )
fstatat( )
ftruncate( )
lchown( )
link( )
linkat( )
lseek( )
lstat( )
open( )
openat( )
pread( )
read( )
readlink( )
readlinkat( )
readv( )
pwrite( )
rename( )
renameat( )
stat( )
symlink( )
symlinkat( )
truncate( )
unlink( )
unlinkat( )
utime( )
utimensat( )
utimes( )
write( )
writev( )

If two threads each call one of these functions, each call shall
either see all of the specified effects
of the other call, or none of them.
]]

I'd bet that there's a bunch of violations to be found, but the
read/write f_pos case is one of the most egregious.

For example, I got curious about stat() versus rename(). If one
stat()s a directory() while a subdirectory is being renamed to a new
name within that directory, does the link count of the parent
directory ever change--that is, could stat() ever see a changed link
count in the middle of the rename()? My experiments suggest that it
can. I suppose it would have to be a very unusual application that
would be troubled by that, but it does appear to be a violation of
2.9.7.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-21  6:01     ` Michael Kerrisk (man-pages)
@ 2014-02-23  1:18       ` Kevin Easton
  2014-02-23  7:38         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Easton @ 2014-02-23  1:18 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Al Viro, Zuckerman, Boris, Linus Torvalds, lkml, Miklos Szeredi,
	Theodore T'so, Christoph Hellwig, Chris Mason, DaveC

On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote:
> Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7:
> 
> [[
> 2.9.7 Thread Interactions with Regular File Operations
> 
> All of the following functions shall be atomic with respect to each
> other in the effects specified in
> POSIX.1-2008 when they operate on regular files or symbolic links:
> 
> chmod( )
> chown( )
> close( )
> creat( )
> dup2( )
> fchmod( )
> fchmodat( )
> fchown( )
> fchownat( )
> fcntl( )
> fstat( )
> fstatat( )
> ftruncate( )
> lchown( )
> link( )
> linkat( )
> lseek( )
> lstat( )
> open( )
> openat( )
> pread( )
> read( )
> readlink( )
> readlinkat( )
> readv( )
> pwrite( )
> rename( )
> renameat( )
> stat( )
> symlink( )
> symlinkat( )
> truncate( )
> unlink( )
> unlinkat( )
> utime( )
> utimensat( )
> utimes( )
> write( )
> writev( )
> 
> If two threads each call one of these functions, each call shall
> either see all of the specified effects
> of the other call, or none of them.
> ]]
> 
> I'd bet that there's a bunch of violations to be found, but the
> read/write f_pos case is one of the most egregious.
> 
> For example, I got curious about stat() versus rename(). If one
> stat()s a directory() while a subdirectory is being renamed to a new
> name within that directory, does the link count of the parent
> directory ever change--that is, could stat() ever see a changed link
> count in the middle of the rename()? My experiments suggest that it
> can. I suppose it would have to be a very unusual application that
> would be troubled by that, but it does appear to be a violation of
> 2.9.7.

A directory isn't a regular file or symlink, though, so the warranty
would seem to be void in that case.

If you {f}stat() a regular file that is being concurrently renamed() then
the link count shouldn't vary, though.

    - Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-23  1:18       ` Kevin Easton
@ 2014-02-23  7:38         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-23  7:38 UTC (permalink / raw)
  To: Kevin Easton
  Cc: Al Viro, Zuckerman, Boris, Linus Torvalds, lkml, Miklos Szeredi,
	Theodore T'so, Christoph Hellwig, Chris Mason, DaveC

On Sun, Feb 23, 2014 at 2:18 AM, Kevin Easton <kevin@guarana.org> wrote:
> On Fri, Feb 21, 2014 at 07:01:31AM +0100, Michael Kerrisk (man-pages) wrote:
>> Here's the fulll list from POSIX.1-2008/SUSv4 Section XSI 2.9.7:
>>
>> [[
>> 2.9.7 Thread Interactions with Regular File Operations
>>
>> All of the following functions shall be atomic with respect to each
>> other in the effects specified in
>> POSIX.1-2008 when they operate on regular files or symbolic links:
>>
>> chmod( )
>> chown( )
>> close( )
>> creat( )
>> dup2( )
>> fchmod( )
>> fchmodat( )
>> fchown( )
>> fchownat( )
>> fcntl( )
>> fstat( )
>> fstatat( )
>> ftruncate( )
>> lchown( )
>> link( )
>> linkat( )
>> lseek( )
>> lstat( )
>> open( )
>> openat( )
>> pread( )
>> read( )
>> readlink( )
>> readlinkat( )
>> readv( )
>> pwrite( )
>> rename( )
>> renameat( )
>> stat( )
>> symlink( )
>> symlinkat( )
>> truncate( )
>> unlink( )
>> unlinkat( )
>> utime( )
>> utimensat( )
>> utimes( )
>> write( )
>> writev( )
>>
>> If two threads each call one of these functions, each call shall
>> either see all of the specified effects
>> of the other call, or none of them.
>> ]]
>>
>> I'd bet that there's a bunch of violations to be found, but the
>> read/write f_pos case is one of the most egregious.
>>
>> For example, I got curious about stat() versus rename(). If one
>> stat()s a directory() while a subdirectory is being renamed to a new
>> name within that directory, does the link count of the parent
>> directory ever change--that is, could stat() ever see a changed link
>> count in the middle of the rename()? My experiments suggest that it
>> can. I suppose it would have to be a very unusual application that
>> would be troubled by that, but it does appear to be a violation of
>> 2.9.7.
>
> A directory isn't a regular file or symlink, though, so the warranty
> would seem to be void in that case.

Oops -- yes, of course.

> If you {f}stat() a regular file that is being concurrently renamed() then
> the link count shouldn't vary, though.

Yes. (I haven't tested that though.)

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:03 George Spelvin
  2014-03-03 21:26 ` Al Viro
@ 2014-05-04  7:04 ` Michael Kerrisk
  1 sibling, 0 replies; 43+ messages in thread
From: Michael Kerrisk @ 2014-05-04  7:04 UTC (permalink / raw)
  To: George Spelvin
  Cc: Linus Torvalds, Linux-Fsdevel, Linux Kernel, Al Viro,
	Michael Kerrisk-manpages

Ouch! I've just seen that trimming the CC on this reply took me out of
a large part of the subsequent conversation. PLEASE don't trim CCs,
and especially not the address of the OP!!

On Mon, Mar 3, 2014 at 10:03 PM, George Spelvin <linux@horizon.com> wrote:
>>  struct fd {
>>       struct file *file;
>> -     int need_put;
>> +     unsigned need_put:1, need_pos_unlock:1;
>>  };
>
> Since we're rounding up to 2*sizeof(struct file *) anyway, is this a case
> where wasting space on a couple of char (or bool) flags would generate
> better code than a bitfield?
>
> In particular, the code to set need_pos_unlock (which will be executed
> each read/write for most files) is messy in the bitfield case.
> A byte store is much cleaner.
>
> (If you want to use bits, why not use the two lsbits of the file pointer
> for the purpose?  That would save a lot of space.)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-05  0:04                           ` Al Viro
@ 2014-03-10 15:55                             ` Al Viro
  0 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2014-03-10 15:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List,
	David Miller, Marcel Holtmann

[-- Attachment #1: Type: text/plain, Size: 3663 bytes --]

On Wed, Mar 05, 2014 at 12:04:11AM +0000, Al Viro wrote:
> There's also a pile of crap around sockfd_lookup/sockfd_put, related
> to that.   Moreover, there's net/compat.c, which probably ought to
> have the compat syscalls themselves moved to net/socket.c (under
> ifdef CONFIG_COMPAT) and switched to sockfd_lookup_light().
> There's l2tp_tunnel_sock_lookup(), which is simply broken - it assumes
> that if tunnel->fd still resolves to a socket, that socket must
> be l2tp one.  Trivial to drive into BUG_ON(), in queue_work() callback,
> no less...  There's bluetooth, assuming that pretty much the same
> (that if it got a file descriptor that resolves to a socket, it must
> be a bluetooth one).  BTW, I wonder what will happen if one gives
> iscsi_sw_tcp_conn_bind() descriptor of a socket of sufficiently
> weird sort...
> 
> Then there's staging/usbip with its sockfd_to_socket(), which is more or
> less parallel to sockfd_lookup().  And open-coded analogs in nbd and
> ncpfs...

OK, I've gone through most of that; bluetooth is, indeed, oopsable (as simple
as e.g.
        int sv[2];
        int fd = socket(PF_BLUETOOTH, SOCK_RAW, BTPROTO_CMTP);
        struct cmtp_connadd_req r = {};
        socketpair(PF_LOCAL, SOCK_STREAM, 0, sv);
        r.sock = sv[0];
        ioctl(fd, CMTPCONNADD, (unsigned long)&r);
and similar with BNEP) and that one is easy to fix.  l2tp I'd rather leave
for net folks to deal with - the problem there is that we stash sock *and*
descriptor number into struct l2tp_tunnel in l2tp_tunnel_create() and expect
l2tp_tunnel_sock_lookup() to find that descriptor (tunnel->fd) resolving
to nothing (if it got already closed) or to the same socket.  Unfortunately,
the caller (l2tp_tunnel_del_work()) expects to find l2tp socket in the latter
case, so having it replaced with unrelated socket will do nasty things
to that caller.  It looks rather silly, actually - the actual fuckup happens
when l2tp_tunnel_del_work() passes what has come from socket->sk to
l2tp_tunnel_sock_put(), which does
        struct l2tp_tunnel *tunnel = l2tp_sock_to_tunnel(sk);
to find the tunnel its caller already had.  Looks too convoluted for its
own good, and my first inclination would be to collapse l2tp_tunnel_sock_*
into the (only) caller, but I'm not sure if I'm not missing some subtle
race prevention in those back-and-forth lookups.  In any case, it can
lead to l2tp_sock_to_tunnel() called on a sock that has nothing to do with
l2tp, so we do have a bug there.

I've attached bluetooth fixes; this stuff is obviously better off in one of
the net trees.  Not sure if it's worth Cc:stable - up to Marcel and Davem.
These bugs are oopsable, but you need CAP_NET_ADMIN to step into those...

I think that what's in vfs.git#for-linus right now is OK to pull; it survives
all the beating I could think of.  Linus, could you pull from the usual place?

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (3):
      ocfs2 syncs the wrong range...
      sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
      get rid of fget_light()

Linus Torvalds (1):
      vfs: atomic f_pos accesses as per POSIX

Diffstat:
 fs/file.c            |   56 +++++++++++++++++++++++++++++++++++++++++++-------------
 fs/file_table.c      |    1 +
 fs/namei.c           |    2 +-
 fs/ocfs2/file.c      |    8 ++++----
 fs/open.c            |    4 ++++
 fs/read_write.c      |   40 ++++++++++++++++++++++++++--------------
 include/linux/file.h |   27 +++++++++++++++------------
 include/linux/fs.h   |    8 ++++++--
 net/socket.c         |   13 +++++++------
 9 files changed, 107 insertions(+), 52 deletions(-)


[-- Attachment #2: 0001-bluetooth-hidp_connection_add-unsafe-use-of-l2cap_pi.patch --]
[-- Type: text/plain, Size: 1026 bytes --]

>From f49d9ab3220ece0b635b18212bfc44444e9b5f41 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Sun, 9 Mar 2014 13:11:59 -0400
Subject: [PATCH 1/3] bluetooth: hidp_connection_add() unsafe use of
 l2cap_pi()

it's OK after we'd verified the sockets, but not before that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/bluetooth/hidp/core.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index d9fb934..6134618 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1332,13 +1332,14 @@ int hidp_connection_add(struct hidp_connadd_req *req,
 {
 	struct hidp_session *session;
 	struct l2cap_conn *conn;
-	struct l2cap_chan *chan = l2cap_pi(ctrl_sock->sk)->chan;
+	struct l2cap_chan *chan;
 	int ret;
 
 	ret = hidp_verify_sockets(ctrl_sock, intr_sock);
 	if (ret)
 		return ret;
 
+	chan = l2cap_pi(ctrl_sock->sk)->chan;
 	conn = NULL;
 	l2cap_chan_lock(chan);
 	if (chan->conn) {
-- 
1.7.10.4


[-- Attachment #3: 0002-cmtp-cmtp_add_connection-should-verify-that-it-s-dea.patch --]
[-- Type: text/plain, Size: 1007 bytes --]

>From 790f94a74f8214baab44eb346a346640e6335319 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon, 10 Mar 2014 10:50:10 -0400
Subject: [PATCH 2/3] cmtp: cmtp_add_connection() should verify that it's
 dealing with l2cap socket

... rather than relying on ciptool(8) never passing it anything else.  Give
it e.g. an AF_UNIX connected socket (from socketpair(2)) and it'll oops,
trying to evaluate &l2cap_pi(sock->sk)->chan->dst...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/bluetooth/cmtp/core.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 67fe5e8..fd57db8 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -333,6 +333,8 @@ int cmtp_add_connection(struct cmtp_connadd_req *req, struct socket *sock)
 	int i, err;
 
 	BT_DBG("");
+	if (!l2cap_is_socket(sock))
+		return -EBADFD;
 
 	session = kzalloc(sizeof(struct cmtp_session), GFP_KERNEL);
 	if (!session)
-- 
1.7.10.4


[-- Attachment #4: 0003-bnep-bnep_add_connection-should-verify-that-it-s-dea.patch --]
[-- Type: text/plain, Size: 849 bytes --]

>From 4d86300249b507f27e8c55c0d3bf1fa2653f9b17 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon, 10 Mar 2014 11:08:35 -0400
Subject: [PATCH 3/3] bnep: bnep_add_connection() should verify that it's
 dealing with l2cap socket

same story as cmtp

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/bluetooth/bnep/core.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index a841d3e..c7a19a1 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -533,6 +533,9 @@ int bnep_add_connection(struct bnep_connadd_req *req, struct socket *sock)
 
 	BT_DBG("");
 
+	if (!l2cap_is_socket(sock))
+		return -EBADFD;
+
 	baswap((void *) dst, &l2cap_pi(sock->sk)->chan->dst);
 	baswap((void *) src, &l2cap_pi(sock->sk)->chan->src);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-06 15:03     ` Michael Kerrisk (man-pages)
@ 2014-03-07  3:38       ` Yongzhi Pan
  0 siblings, 0 replies; 43+ messages in thread
From: Yongzhi Pan @ 2014-03-07  3:38 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Linus Torvalds, lkml, Miklos Szeredi, Theodore T'so,
	Christoph Hellwig, Chris Mason, Dave Chinner, Linux-Fsdevel,
	Al Viro, J. Bruce Fields

2014-03-06 23:03 GMT+08:00 Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>:
>
> The original code (where Yongzhi Pan reported an issue) and
> the multi_writer.c test code where both mine actually. Anyway,
> I applied the patch to 3.14-rc5, and I (not withstanding the other
> points raised by Al about the patch) I confirm that this patch makes
> the  problem that I'm seeing go away.
>
> Thanks for looking at this, Linus.
>
> Cheers,
>
> Michael


I tested the original patch on 3.13.5, the multi_writer.c test code
and the code in your book both run flawlessly.


Best Regards,
Yongzhi Pan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 17:36   ` Linus Torvalds
  2014-03-03 21:45     ` Al Viro
@ 2014-03-06 15:03     ` Michael Kerrisk (man-pages)
  2014-03-07  3:38       ` Yongzhi Pan
  1 sibling, 1 reply; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-03-06 15:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mtk.manpages, lkml, Miklos Szeredi, Theodore T'so,
	Christoph Hellwig, Chris Mason, Dave Chinner, Linux-Fsdevel,
	Al Viro, J. Bruce Fields, Yongzhi Pan

On 03/03/2014 06:36 PM, Linus Torvalds wrote:
> Ok, sorry for the long delay, I was distracted (and hoping that Al
> would come up with a patch).
> 
> Anyway, attached is the patch I think we should do for this issue. It
> is fairly simple:
> 
>  - it adds a "f_pos_mutex" to the "struct file".
> 
>  - it adds a new FMODE_ATOMIC_POS flag to the file mode flags to mark
> things that need atomic f_pos updates
> 
>  - it makes the "struct fd" flags be two flags rather than one: the
> second flag is for "unlock f_pos_mutex when done"
> 
>  - it introduces "fd[get,put]_pos()" which gets the f_pos_mutex when required
> 
>  - it makes read/write/lseek use that.
> 
> It's pretty damn straightforward, I think, and is minimally serializing.
> 
> Al, comments? Yongzhi Pan, this is pretty much untested, but it's
> pretty simple and it does fix your test-case.

The original code (where Yongzhi Pan reported an issue) and
the multi_writer.c test code where both mine actually. Anyway,
I applied the patch to 3.14-rc5, and I (not withstanding the other
points raised by Al about the patch) I confirm that this patch makes 
the  problem that I'm seeing go away.

Thanks for looking at this, Linus.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-04 21:17                         ` Linus Torvalds
@ 2014-03-05  0:04                           ` Al Viro
  2014-03-10 15:55                             ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-05  0:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 04, 2014 at 01:17:50PM -0800, Linus Torvalds wrote:
> On Tue, Mar 4, 2014 at 12:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, with the attached set (the first one is essentially unchanged from
> > your first one), it seems to work and produce better code on all targets
> > I've tried.  Comments?
> 
> I'm certainly ok with it. You seem to have left the fput_light()
> function around, though, despite removing fget_[raw_]light(). That
> seems a bit silly, since there is no valid use any more apart from
> net/socket.c that now doesn't balance things properly.

There's also a pile of crap around sockfd_lookup/sockfd_put, related
to that.   Moreover, there's net/compat.c, which probably ought to
have the compat syscalls themselves moved to net/socket.c (under
ifdef CONFIG_COMPAT) and switched to sockfd_lookup_light().
There's l2tp_tunnel_sock_lookup(), which is simply broken - it assumes
that if tunnel->fd still resolves to a socket, that socket must
be l2tp one.  Trivial to drive into BUG_ON(), in queue_work() callback,
no less...  There's bluetooth, assuming that pretty much the same
(that if it got a file descriptor that resolves to a socket, it must
be a bluetooth one).  BTW, I wonder what will happen if one gives
iscsi_sw_tcp_conn_bind() descriptor of a socket of sufficiently
weird sort...

Then there's staging/usbip with its sockfd_to_socket(), which is more or
less parallel to sockfd_lookup().  And open-coded analogs in nbd and
ncpfs...

> > I've also pushed those (on top of old ocfs2 fix) into vfs.git#for-linus,
> > if you prefer to read it that way.  Should propagate in a few...
> 
> Should I pull?
> 
> I also get the feeling that the first patch should likely be marked
> for stable. Hmm?

It should; I'll mark it such when I send a pull request.  I really want
to sort the situation with sockfd_lookup() and friends out, though - at
least to the point where I would understand how painful the fixes
will be.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-04 20:00                       ` Al Viro
@ 2014-03-04 21:17                         ` Linus Torvalds
  2014-03-05  0:04                           ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-04 21:17 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Tue, Mar 4, 2014 at 12:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, with the attached set (the first one is essentially unchanged from
> your first one), it seems to work and produce better code on all targets
> I've tried.  Comments?

I'm certainly ok with it. You seem to have left the fput_light()
function around, though, despite removing fget_[raw_]light(). That
seems a bit silly, since there is no valid use any more apart from
net/socket.c that now doesn't balance things properly.

> I've also pushed those (on top of old ocfs2 fix) into vfs.git#for-linus,
> if you prefer to read it that way.  Should propagate in a few...

Should I pull?

I also get the feeling that the first patch should likely be marked
for stable. Hmm?

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:39         ` Al Viro
  2014-03-03 23:54           ` Linus Torvalds
  2014-03-03 23:54           ` Al Viro
@ 2014-03-04 20:11           ` Cedric Blancher
  2 siblings, 0 replies; 43+ messages in thread
From: Cedric Blancher @ 2014-03-04 20:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On 4 March 2014 00:39, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Mar 03, 2014 at 03:23:55PM -0800, Linus Torvalds wrote:
>
>> This just uses a "flags" field, and we currently only have two bits
>> that we use: FDPUT_FPUT and FDPUT_POS_UNLOCK. The first patch knows
>> that "fget_light()" writes 0/1 for that, which is the same as the
>> FDPUT_FPUT bit. I didn't bother to comment on it or clean it up, since
>> the second patch just removes that whole fget_light() mess.
>>
>> Comments?
>
> do_sendfile() is also there and this one is even more unpleasant ;-/
> We probably can ignore that one (until POSIX learns of its existence),
> thouhg...

I've forwarded your request to the Austin Group (who manage the POSIX stuff).

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
Institute Pasteur

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-04  1:05                     ` Al Viro
@ 2014-03-04 20:00                       ` Al Viro
  2014-03-04 21:17                         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-04 20:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

On Tue, Mar 04, 2014 at 01:05:17AM +0000, Al Viro wrote:

> It's probably not worth replacing struct fd with typedef to unsigned long -
> too easy to have it confused with a file descriptor itself and pass to
> something that expects e.g. int.  In any case, since we leave fdget()
> inlined, compiler will see the unsigned long it's been initialized with,
> so if it decides that it's cheaper to replace f.file with v & ~3 through
> the whole thing, keep v around and discard local struct fd completely,
> it'll be able to do so just fine...
> 
> I'll play around with cross-builds a bit and see what falls out of all
> that.

OK, with the attached set (the first one is essentially unchanged from
your first one), it seems to work and produce better code on all targets
I've tried.  Comments?

I've also pushed those (on top of old ocfs2 fix) into vfs.git#for-linus,
if you prefer to read it that way.  Should propagate in a few...

[-- Attachment #2: 0001-vfs-atomic-f_pos-accesses-as-per-POSIX.patch --]
[-- Type: text/plain, Size: 8698 bytes --]

>From 61660f7691b301dce7ed2d995d9e95ca4353e51d Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 3 Mar 2014 09:36:58 -0800
Subject: [PATCH 1/3] vfs: atomic f_pos accesses as per POSIX

Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file_table.c      |    1 +
 fs/namei.c           |    2 +-
 fs/open.c            |    4 ++++
 fs/read_write.c      |   54 +++++++++++++++++++++++++++++++++++++-------------
 include/linux/file.h |    6 ++++--
 include/linux/fs.h   |    6 +++++-
 6 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5fff903..5b24008 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -135,6 +135,7 @@ struct file *get_empty_filp(void)
 	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	spin_lock_init(&f->f_lock);
+	mutex_init(&f->f_pos_lock);
 	eventpoll_init_file(f);
 	/* f->f_version: 0 */
 	return f;
diff --git a/fs/namei.c b/fs/namei.c
index 385f781..2f730ef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1884,7 +1884,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 
 		nd->path = f.file->f_path;
 		if (flags & LOOKUP_RCU) {
-			if (f.need_put)
+			if (f.flags & FDPUT_FPUT)
 				*fp = f.file;
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			rcu_read_lock();
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..b9ed8b2 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -705,6 +705,10 @@ static int do_dentry_open(struct file *f,
 		return 0;
 	}
 
+	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
+	if (S_ISREG(inode->i_mode))
+		f->f_mode |= FMODE_ATOMIC_POS;
+
 	f->f_op = fops_get(inode->i_fop);
 	if (unlikely(WARN_ON(!f->f_op))) {
 		error = -ENODEV;
diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746..932bb34 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,10 +264,36 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
+/*
+ * We only lock f_pos if we have threads or if the file might be
+ * shared with another process. In both cases we'll have an elevated
+ * file count (done either by fdget() or by fork()).
+ */
+static inline struct fd fdget_pos(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+
+	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
+		if (file_count(file) > 1) {
+			f.flags |= FDPUT_POS_UNLOCK;
+			mutex_lock(&file->f_pos_lock);
+		}
+	}
+	return f;
+}
+
+static inline void fdput_pos(struct fd f)
+{
+	if (f.flags & FDPUT_POS_UNLOCK)
+		mutex_unlock(&f.file->f_pos_lock);
+	fdput(f);
+}
+
 SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 {
 	off_t retval;
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	if (!f.file)
 		return -EBADF;
 
@@ -278,7 +304,7 @@ SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 		if (res != (loff_t)retval)
 			retval = -EOVERFLOW;	/* LFS: should only happen on 32 bit platforms */
 	}
-	fdput(f);
+	fdput_pos(f);
 	return retval;
 }
 
@@ -498,7 +524,7 @@ static inline void file_pos_write(struct file *file, loff_t pos)
 
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -506,7 +532,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 		ret = vfs_read(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 	return ret;
 }
@@ -514,7 +540,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -522,7 +548,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		ret = vfs_write(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	return ret;
@@ -797,7 +823,7 @@ EXPORT_SYMBOL(vfs_writev);
 SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -805,7 +831,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_readv(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
@@ -817,7 +843,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -825,7 +851,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_writev(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
@@ -968,7 +994,7 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 		const struct compat_iovec __user *,vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret;
 	loff_t pos;
 
@@ -978,7 +1004,7 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 	ret = compat_readv(f.file, vec, vlen, &pos);
 	if (ret >= 0)
 		f.file->f_pos = pos;
-	fdput(f);
+	fdput_pos(f);
 	return ret;
 }
 
@@ -1035,7 +1061,7 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 		const struct compat_iovec __user *, vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret;
 	loff_t pos;
 
@@ -1045,7 +1071,7 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 	ret = compat_writev(f.file, vec, vlen, &pos);
 	if (ret >= 0)
 		f.file->f_pos = pos;
-	fdput(f);
+	fdput_pos(f);
 	return ret;
 }
 
diff --git a/include/linux/file.h b/include/linux/file.h
index cbacf4f..f2517fa 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -28,12 +28,14 @@ static inline void fput_light(struct file *file, int fput_needed)
 
 struct fd {
 	struct file *file;
-	int need_put;
+	unsigned int flags;
 };
+#define FDPUT_FPUT       1
+#define FDPUT_POS_UNLOCK 2
 
 static inline void fdput(struct fd fd)
 {
-	if (fd.need_put)
+	if (fd.flags & FDPUT_FPUT)
 		fput(fd.file);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6082956..ebfde04 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -123,6 +123,9 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is opened with O_PATH; almost nothing can be done with it */
 #define FMODE_PATH		((__force fmode_t)0x4000)
 
+/* File needs atomic accesses to f_pos */
+#define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
@@ -780,13 +783,14 @@ struct file {
 	const struct file_operations	*f_op;
 
 	/*
-	 * Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR.
+	 * Protects f_ep_links, f_flags.
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
+	struct mutex		f_pos_lock;
 	loff_t			f_pos;
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;
-- 
1.7.10.4


[-- Attachment #3: 0002-sockfd_lookup_light-switch-to-fdget-W-Waway-from-fge.patch --]
[-- Type: text/plain, Size: 1022 bytes --]

>From 60990b46c2ac7bbf87ecfd64843f7f19a06822f6 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon, 3 Mar 2014 23:48:18 -0500
Subject: [PATCH 2/3] sockfd_lookup_light(): switch to fdget^W^Waway from
 fget_light

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/socket.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 879933a..fd8d86e 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -450,16 +450,17 @@ EXPORT_SYMBOL(sockfd_lookup);
 
 static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
 {
-	struct file *file;
+	struct fd f = fdget(fd);
 	struct socket *sock;
 
 	*err = -EBADF;
-	file = fget_light(fd, fput_needed);
-	if (file) {
-		sock = sock_from_file(file, err);
-		if (sock)
+	if (f.file) {
+		sock = sock_from_file(f.file, err);
+		if (likely(sock)) {
+			*fput_needed = f.flags;
 			return sock;
-		fput_light(file, *fput_needed);
+		}
+		fdput(f);
 	}
 	return NULL;
 }
-- 
1.7.10.4


[-- Attachment #4: 0003-get-rid-of-fget_light.patch --]
[-- Type: text/plain, Size: 5190 bytes --]

>From 6053e1188ff288d95f3fa21965edf1137f478507 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 4 Mar 2014 14:54:22 -0500
Subject: [PATCH 3/3] get rid of fget_light()

instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file.c            |   56 ++++++++++++++++++++++++++++++++++++++------------
 fs/read_write.c      |   16 +--------------
 include/linux/file.h |   21 ++++++++++---------
 include/linux/fs.h   |    2 +-
 4 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index db25c2b..60a45e9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -683,35 +683,65 @@ EXPORT_SYMBOL(fget_raw);
  * The fput_needed flag returned by fget_light should be passed to the
  * corresponding fput_light.
  */
-struct file *__fget_light(unsigned int fd, fmode_t mask, int *fput_needed)
+static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
 
-	*fput_needed = 0;
 	if (atomic_read(&files->count) == 1) {
 		file = __fcheck_files(files, fd);
-		if (file && (file->f_mode & mask))
-			file = NULL;
+		if (!file || unlikely(file->f_mode & mask))
+			return 0;
+		return (unsigned long)file;
 	} else {
 		file = __fget(fd, mask);
-		if (file)
-			*fput_needed = 1;
+		if (!file)
+			return 0;
+		return FDPUT_FPUT | (unsigned long)file;
 	}
-
-	return file;
 }
-struct file *fget_light(unsigned int fd, int *fput_needed)
+unsigned long __fdget(unsigned int fd)
 {
-	return __fget_light(fd, FMODE_PATH, fput_needed);
+	return __fget_light(fd, FMODE_PATH);
 }
-EXPORT_SYMBOL(fget_light);
+EXPORT_SYMBOL(__fdget);
 
-struct file *fget_raw_light(unsigned int fd, int *fput_needed)
+unsigned long __fdget_raw(unsigned int fd)
 {
-	return __fget_light(fd, 0, fput_needed);
+	return __fget_light(fd, 0);
+}
+
+unsigned long __fdget_pos(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct file *file;
+	unsigned long v;
+
+	if (atomic_read(&files->count) == 1) {
+		file = __fcheck_files(files, fd);
+		v = 0;
+	} else {
+		file = __fget(fd, 0);
+		v = FDPUT_FPUT;
+	}
+	if (!file)
+		return 0;
+
+	if (file->f_mode & FMODE_ATOMIC_POS) {
+		if (file_count(file) > 1) {
+			v |= FDPUT_POS_UNLOCK;
+			mutex_lock(&file->f_pos_lock);
+		}
+	}
+	return v | (unsigned long)file;
 }
 
+/*
+ * We only lock f_pos if we have threads or if the file might be
+ * shared with another process. In both cases we'll have an elevated
+ * file count (done either by fdget() or by fork()).
+ */
+
 void set_close_on_exec(unsigned int fd, int flag)
 {
 	struct files_struct *files = current->files;
diff --git a/fs/read_write.c b/fs/read_write.c
index 932bb34..54e19b9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,23 +264,9 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
-/*
- * We only lock f_pos if we have threads or if the file might be
- * shared with another process. In both cases we'll have an elevated
- * file count (done either by fdget() or by fork()).
- */
 static inline struct fd fdget_pos(int fd)
 {
-	struct fd f = fdget(fd);
-	struct file *file = f.file;
-
-	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
-		if (file_count(file) > 1) {
-			f.flags |= FDPUT_POS_UNLOCK;
-			mutex_lock(&file->f_pos_lock);
-		}
-	}
-	return f;
+	return __to_fd(__fdget_pos(fd));
 }
 
 static inline void fdput_pos(struct fd f)
diff --git a/include/linux/file.h b/include/linux/file.h
index f2517fa..4d69123 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -40,23 +40,24 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
-extern struct file *fget_light(unsigned int fd, int *fput_needed);
+extern struct file *fget_raw(unsigned int fd);
+extern unsigned long __fdget(unsigned int fd);
+extern unsigned long __fdget_raw(unsigned int fd);
+extern unsigned long __fdget_pos(unsigned int fd);
 
-static inline struct fd fdget(unsigned int fd)
+static inline struct fd __to_fd(unsigned long v)
 {
-	int b;
-	struct file *f = fget_light(fd, &b);
-	return (struct fd){f,b};
+	return (struct fd){(struct file *)(v & ~3),v & 3};
 }
 
-extern struct file *fget_raw(unsigned int fd);
-extern struct file *fget_raw_light(unsigned int fd, int *fput_needed);
+static inline struct fd fdget(unsigned int fd)
+{
+	return __to_fd(__fdget(fd));
+}
 
 static inline struct fd fdget_raw(unsigned int fd)
 {
-	int b;
-	struct file *f = fget_raw_light(fd, &b);
-	return (struct fd){f,b};
+	return __to_fd(__fdget_raw(fd));
 }
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebfde04..23b2a35 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -812,7 +812,7 @@ struct file {
 #ifdef CONFIG_DEBUG_WRITECOUNT
 	unsigned long f_mnt_write_state;
 #endif
-};
+} __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
 struct file_handle {
 	__u32 handle_bytes;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-04  0:42                   ` Linus Torvalds
@ 2014-03-04  1:05                     ` Al Viro
  2014-03-04 20:00                       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-04  1:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 04:42:36PM -0800, Linus Torvalds wrote:
> On Mon, Mar 3, 2014 at 4:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > I wonder if something like
> >
> > static inline struct fd fdget(int fd)
> > {
> >         unsigned long v = __fdget(fd);
> >         return (struct fd){(struct file *)(v & ~1), v & 1};
> > }
> 
> Make it use "&3" so that we have the two bits we need, and I'll
> happily take that approach, yes.

Hmm...  You want to have that second flag dealt with inside fs/file.c
as well?  OK, but that'll need one more thing - __attribute__((aligned(4)))
on struct file declaration.  We have at least one weird architecture that
is happy to align everything on 16bit boundaries.  Granted, it's half-dead,
but there might be something besides m68k with the same weirdness...

It's probably not worth replacing struct fd with typedef to unsigned long -
too easy to have it confused with a file descriptor itself and pass to
something that expects e.g. int.  In any case, since we leave fdget()
inlined, compiler will see the unsigned long it's been initialized with,
so if it decides that it's cheaper to replace f.file with v & ~3 through
the whole thing, keep v around and discard local struct fd completely,
it'll be able to do so just fine...

I'll play around with cross-builds a bit and see what falls out of all
that.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-04  0:23                 ` Al Viro
@ 2014-03-04  0:42                   ` Linus Torvalds
  2014-03-04  1:05                     ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-04  0:42 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 4:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I wonder if something like
>
> static inline struct fd fdget(int fd)
> {
>         unsigned long v = __fdget(fd);
>         return (struct fd){(struct file *)(v & ~1), v & 1};
> }

Make it use "&3" so that we have the two bits we need, and I'll
happily take that approach, yes.

                    Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:59               ` Linus Torvalds
@ 2014-03-04  0:23                 ` Al Viro
  2014-03-04  0:42                   ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-04  0:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 03:59:36PM -0800, Linus Torvalds wrote:

> I doubt it's worth caring about. Even when passing things in memory,
> the end result isn't that much worse than the fget_light() model that
> passes just one of the two fields in memory.

I'm not sure if that's the right approach, TBH.  I wonder if something
like
static inline struct fd fdget(int fd)
{
	unsigned long v = __fdget(fd);
	return (struct fd){(struct file *)(v & ~1), v & 1};
}
would not be a better starting point, with __fdget(fd) being
{
        struct files_struct *files = current->files;
        struct file *file;
        if (atomic_read(&files->count) == 1) {
                file = __fcheck_files(files, fd);
                if (file && (file->f_mode & FMODE_PATH))
                        return 0;
		return (unsigned long)file;
        } else {
                file = __fget(fd, FMODE_PATH);
                return file ? 1 | (unsigned long)file : 0;
        }
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:52   ` Linus Torvalds
  2014-03-03 22:01     ` Al Viro
  2014-03-03 22:55     ` Linus Torvalds
@ 2014-03-04  0:07     ` George Spelvin
  2 siblings, 0 replies; 43+ messages in thread
From: George Spelvin @ 2014-03-04  0:07 UTC (permalink / raw)
  To: torvalds, viro; +Cc: linux-fsdevel, linux-kernel, linux

Linus Torvalds <torvalds@linux-foundation.org> wrote:
On Mon, Mar 3, 2014 at 1:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> Most of the cases have it kept separately in registers, actually - there's
>> a reason why fdget() and friends are inlined.

> Yes. And bit test and set ops on registers are actually cheaper than
> playing around with bytes.

Ah.  I cut & pasted the code into a separate file and compiled it
out of line.  But I stubbed a lot, so it went to memory.  My bad.

> Oh, and George, your email setup is broken. Gmail thinks your emails
> are spam. I'm not sure why (the usual spf problems do not apply), but
> it's possibly because your name and email looks made up.

Thanks, but damn, I wish they'd give a little more information.

The horizon.com MX record is correct, the mail server's forward and
reverse DNS matches, the SMTP server is a little creaky but I think it's
completely standards-compliant.

I'm using bsd mailx, which is *definitely* creaky (you'll notice a
complete lack of MIME or User-Agent headers), but again, it's kind of
the baseline standard for RFC(2)822 e-mail.

The one infelicity I'm aware of is that I cut & pasted the headers from
an on-line mail archive, but mailx doesn't have an easy way to add
an In-Reply-To: header.  Perhaps a Subject: beginning with "Re:" and
no In-Reply-To: looks odd.

But damnifino.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:42             ` Al Viro
@ 2014-03-03 23:59               ` Linus Torvalds
  2014-03-04  0:23                 ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 23:59 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 3:42 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Mar 03, 2014 at 03:34:43PM -0800, Linus Torvalds wrote:
>>
>> (Side note: I think sparc or something doesn't support it, and may
>> return things in memory. I can't really seem to find it in myself to
>> care)
>
> sparc64 actually does support that.  So does amd64, and, with explicit
> flag, i386.  No other more or less general purpose architecture does.
>
> Not ppc.  Not mips.  Not arm.  I think that some of those are worth
> caring about...

I doubt it's worth caring about. Even when passing things in memory,
the end result isn't that much worse than the fget_light() model that
passes just one of the two fields in memory.

If the ARM/PPC people end up caring, they could add the struct-return
support to gcc. It should be basically just adding the flag for
enabling the calling convention - the core gcc support is obviously
all there, and it's just an issue of whether the calling conventions
allow it or not.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:39         ` Al Viro
  2014-03-03 23:54           ` Linus Torvalds
@ 2014-03-03 23:54           ` Al Viro
  2014-03-04 20:11           ` Cedric Blancher
  2 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2014-03-03 23:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 11:39:02PM +0000, Al Viro wrote:
> On Mon, Mar 03, 2014 at 03:23:55PM -0800, Linus Torvalds wrote:
> 
> > This just uses a "flags" field, and we currently only have two bits
> > that we use: FDPUT_FPUT and FDPUT_POS_UNLOCK. The first patch knows
> > that "fget_light()" writes 0/1 for that, which is the same as the
> > FDPUT_FPUT bit. I didn't bother to comment on it or clean it up, since
> > the second patch just removes that whole fget_light() mess.
> > 
> > Comments?
> 
> do_sendfile() is also there and this one is even more unpleasant ;-/
> We probably can ignore that one (until POSIX learns of its existence),
> thouhg...

OK, other than read/write/readv/writev/lseek we have only sendfile{,64} and
splice; everything else either deals with struct file that is guaranteed to
be not accessible by any syscall (e.g. coredump code) or is not a regular
file.  So the only question is whether we care for syscalls that are
out of scope for POSIX.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:39         ` Al Viro
@ 2014-03-03 23:54           ` Linus Torvalds
  2014-03-03 23:54           ` Al Viro
  2014-03-04 20:11           ` Cedric Blancher
  2 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 23:54 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 3:39 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> do_sendfile() is also there and this one is even more unpleasant ;-/
> We probably can ignore that one (until POSIX learns of its existence),
> thouhg...

Yeah, I saw the do_sendfile one and decided we don't care. Not only is
is out of POSIX spec (so if you break it you get to keep both pieces),
the whole sendfile() thing is a bit of a hack.

But we could take the f_pos_lock explicitly in do_sendfile for the
!ppos case if we decide we care. The error handling is the main
annoyance.

           Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:34           ` Linus Torvalds
@ 2014-03-03 23:42             ` Al Viro
  2014-03-03 23:59               ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 23:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 03:34:43PM -0800, Linus Torvalds wrote:
> On Mon, Mar 3, 2014 at 3:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...   I would be very surprised if that worked well.  You have just
> > forced fdget() to comply to the ABI.  And unless that has such structs
> > returned in register pairs, there's no way for compiler to do about
> > that.
> 
> Register pairs are very much a common return model.
> 
> And we've relied on that before. For example, 64-bit pte's on x86-32
> very much does that whole thing with the "pte_t" union.
> 
> So you can now commence being surprised.
> 
> (Side note: I think sparc or something doesn't support it, and may
> return things in memory. I can't really seem to find it in myself to
> care)

sparc64 actually does support that.  So does amd64, and, with explicit
flag, i386.  No other more or less general purpose architecture does.

Not ppc.  Not mips.  Not arm.  I think that some of those are worth
caring about...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:23       ` Linus Torvalds
@ 2014-03-03 23:39         ` Al Viro
  2014-03-03 23:54           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Al Viro @ 2014-03-03 23:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 03:23:55PM -0800, Linus Torvalds wrote:

> This just uses a "flags" field, and we currently only have two bits
> that we use: FDPUT_FPUT and FDPUT_POS_UNLOCK. The first patch knows
> that "fget_light()" writes 0/1 for that, which is the same as the
> FDPUT_FPUT bit. I didn't bother to comment on it or clean it up, since
> the second patch just removes that whole fget_light() mess.
> 
> Comments?

do_sendfile() is also there and this one is even more unpleasant ;-/
We probably can ignore that one (until POSIX learns of its existence),
thouhg...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 23:28         ` Al Viro
@ 2014-03-03 23:34           ` Linus Torvalds
  2014-03-03 23:42             ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 23:34 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 3:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...   I would be very surprised if that worked well.  You have just
> forced fdget() to comply to the ABI.  And unless that has such structs
> returned in register pairs, there's no way for compiler to do about
> that.

Register pairs are very much a common return model.

And we've relied on that before. For example, 64-bit pte's on x86-32
very much does that whole thing with the "pte_t" union.

So you can now commence being surprised.

(Side note: I think sparc or something doesn't support it, and may
return things in memory. I can't really seem to find it in myself to
care)

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:17       ` Linus Torvalds
@ 2014-03-03 23:28         ` Al Viro
  2014-03-03 23:34           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 23:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 02:17:55PM -0800, Linus Torvalds wrote:

> Something like the attached untested patch. This gets rid of
> "fget_light()", and instead makes "fdget()" the native interface (same
> for the "raw" version).

Umm...   I would be very surprised if that worked well.  You have just
forced fdget() to comply to the ABI.  And unless that has such structs
returned in register pairs, there's no way for compiler to do about
that.

The current variant lets the optimizer see what's going on and decide
that since it doesn't need struct fd local variable kept as a single
object, so it can turn it into a pair of locals, independently assigned
(in the inlined fdget() body) and independently used.  If those can
be put into registers, it happens separately for both, etc.

Try your variant on e.g. ppc or alpha.  Both will pass that struct on
stack.  So will mips.  So will s390.  So will 32bit sparc, not that there
had been much use of sparc32 kernels...

It will work on amd64, sparc64.  On i386 we explicitly pass it
-freg-struct-return, so it passes in a pair of registers there as well.

We could start usinmg -freg-struct-return on other architectures, but
I'm not sure if it would actually work - e.g. on ppc it still returns
such a struct on stack.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:55     ` Linus Torvalds
@ 2014-03-03 23:23       ` Linus Torvalds
  2014-03-03 23:39         ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 23:23 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On Mon, Mar 3, 2014 at 2:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think I'll respin this with the compat readv/writev case fixed and
> with the bitfield replaced with an "unsigned int" that we just do
> bitops by hand on. Because that code generation makes me feel slightly
> ill.

Ok, how about this version? For some reason I thought the compat
functions for readv/writev were in fs/compat.c, but they are there in
the same fs/read_write.c, so I didn't need to move the helper
functions away.

This just uses a "flags" field, and we currently only have two bits
that we use: FDPUT_FPUT and FDPUT_POS_UNLOCK. The first patch knows
that "fget_light()" writes 0/1 for that, which is the same as the
FDPUT_FPUT bit. I didn't bother to comment on it or clean it up, since
the second patch just removes that whole fget_light() mess.

Comments?

           Linus

[-- Attachment #2: 0001-vfs-atomic-f_pos-accesses-as-per-POSIX.patch --]
[-- Type: text/x-patch, Size: 8681 bytes --]

From 9ed5720e8a3139cc703ae19aef5fd3a72658d132 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 3 Mar 2014 09:36:58 -0800
Subject: [PATCH 1/2] vfs: atomic f_pos accesses as per POSIX

Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

 "2.9.7 Thread Interactions with Regular File Operations

  All of the following functions shall be atomic with respect to each
  other in the effects specified in POSIX.1-2008 when they operate on
  regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/file_table.c      |  1 +
 fs/namei.c           |  2 +-
 fs/open.c            |  4 ++++
 fs/read_write.c      | 54 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/file.h |  6 ++++--
 include/linux/fs.h   |  6 +++++-
 6 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5fff9030be34..5b24008ea4f6 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -135,6 +135,7 @@ struct file *get_empty_filp(void)
 	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	spin_lock_init(&f->f_lock);
+	mutex_init(&f->f_pos_lock);
 	eventpoll_init_file(f);
 	/* f->f_version: 0 */
 	return f;
diff --git a/fs/namei.c b/fs/namei.c
index 385f7817bfcc..2f730ef9b4b3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1884,7 +1884,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 
 		nd->path = f.file->f_path;
 		if (flags & LOOKUP_RCU) {
-			if (f.need_put)
+			if (f.flags & FDPUT_FPUT)
 				*fp = f.file;
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			rcu_read_lock();
diff --git a/fs/open.c b/fs/open.c
index 4b3e1edf2fe4..b9ed8b25c108 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -705,6 +705,10 @@ static int do_dentry_open(struct file *f,
 		return 0;
 	}
 
+	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
+	if (S_ISREG(inode->i_mode))
+		f->f_mode |= FMODE_ATOMIC_POS;
+
 	f->f_op = fops_get(inode->i_fop);
 	if (unlikely(WARN_ON(!f->f_op))) {
 		error = -ENODEV;
diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..e5b43dd53b94 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,10 +264,36 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
+/*
+ * We only lock f_pos if we have threads or if the file might be
+ * shared with another process. In both cases we'll have an elevated
+ * file count (done either by fdget() or by fork()).
+ */
+static struct fd fdget_pos(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+
+	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
+		if (file_count(file) > 1) {
+			f.flags |= FDPUT_POS_UNLOCK;
+			mutex_lock(&file->f_pos_lock);
+		}
+	}
+	return f;
+}
+
+static void fdput_pos(struct fd f)
+{
+	if (f.flags & FDPUT_POS_UNLOCK)
+		mutex_unlock(&f.file->f_pos_lock);
+	fdput(f);
+}
+
 SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 {
 	off_t retval;
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	if (!f.file)
 		return -EBADF;
 
@@ -278,7 +304,7 @@ SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 		if (res != (loff_t)retval)
 			retval = -EOVERFLOW;	/* LFS: should only happen on 32 bit platforms */
 	}
-	fdput(f);
+	fdput_pos(f);
 	return retval;
 }
 
@@ -498,7 +524,7 @@ static inline void file_pos_write(struct file *file, loff_t pos)
 
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -506,7 +532,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 		ret = vfs_read(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 	return ret;
 }
@@ -514,7 +540,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -522,7 +548,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		ret = vfs_write(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	return ret;
@@ -797,7 +823,7 @@ EXPORT_SYMBOL(vfs_writev);
 SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -805,7 +831,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_readv(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
@@ -817,7 +843,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -825,7 +851,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_writev(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
@@ -968,7 +994,7 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 		const struct compat_iovec __user *,vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret;
 	loff_t pos;
 
@@ -978,7 +1004,7 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 	ret = compat_readv(f.file, vec, vlen, &pos);
 	if (ret >= 0)
 		f.file->f_pos = pos;
-	fdput(f);
+	fdput_pos(f);
 	return ret;
 }
 
@@ -1035,7 +1061,7 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 		const struct compat_iovec __user *, vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret;
 	loff_t pos;
 
@@ -1045,7 +1071,7 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 	ret = compat_writev(f.file, vec, vlen, &pos);
 	if (ret >= 0)
 		f.file->f_pos = pos;
-	fdput(f);
+	fdput_pos(f);
 	return ret;
 }
 
diff --git a/include/linux/file.h b/include/linux/file.h
index cbacf4faf447..f2517fa2d610 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -28,12 +28,14 @@ static inline void fput_light(struct file *file, int fput_needed)
 
 struct fd {
 	struct file *file;
-	int need_put;
+	unsigned int flags;
 };
+#define FDPUT_FPUT       1
+#define FDPUT_POS_UNLOCK 2
 
 static inline void fdput(struct fd fd)
 {
-	if (fd.need_put)
+	if (fd.flags & FDPUT_FPUT)
 		fput(fd.file);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 60829565e552..ebfde04bca06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -123,6 +123,9 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is opened with O_PATH; almost nothing can be done with it */
 #define FMODE_PATH		((__force fmode_t)0x4000)
 
+/* File needs atomic accesses to f_pos */
+#define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
@@ -780,13 +783,14 @@ struct file {
 	const struct file_operations	*f_op;
 
 	/*
-	 * Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR.
+	 * Protects f_ep_links, f_flags.
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
+	struct mutex		f_pos_lock;
 	loff_t			f_pos;
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;
-- 
1.9.0


[-- Attachment #3: 0002-vfs-get-rid-of-old-legacy-f-get-put-_-raw_-_light-in.patch --]
[-- Type: text/x-patch, Size: 4211 bytes --]

From f55e00e40e6aba3e1e8fc701a7069a0b7a0868b7 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 3 Mar 2014 14:43:03 -0800
Subject: [PATCH 2/2] vfs: get rid of old legacy "f{get,put}_[raw_]_light()"
 interfaces

Everything but the net/socket.c use has long since been converted to the
fdget() model that has a cleaner calling convention, so just get rid of
the old interface and teach the socket code to use the new model using a
trivial shim layer.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/file.c            | 30 +++++++++++++++---------------
 include/linux/file.h | 24 ++----------------------
 net/socket.c         | 17 ++++++++++++-----
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index db25c2bdfe46..baaa755f3919 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -683,33 +683,33 @@ EXPORT_SYMBOL(fget_raw);
  * The fput_needed flag returned by fget_light should be passed to the
  * corresponding fput_light.
  */
-struct file *__fget_light(unsigned int fd, fmode_t mask, int *fput_needed)
+static inline struct fd __fdget(unsigned int fd, fmode_t mask)
 {
 	struct files_struct *files = current->files;
-	struct file *file;
+	struct fd f = { NULL, };
 
-	*fput_needed = 0;
 	if (atomic_read(&files->count) == 1) {
-		file = __fcheck_files(files, fd);
-		if (file && (file->f_mode & mask))
-			file = NULL;
+		f.file = __fcheck_files(files, fd);
+		if (f.file && (f.file->f_mode & mask))
+			f.file = NULL;
 	} else {
-		file = __fget(fd, mask);
-		if (file)
-			*fput_needed = 1;
+		f.file = __fget(fd, mask);
+		if (f.file)
+			f.flags = FDPUT_FPUT;
 	}
 
-	return file;
+	return f;
 }
-struct file *fget_light(unsigned int fd, int *fput_needed)
+
+struct fd fdget(unsigned int fd)
 {
-	return __fget_light(fd, FMODE_PATH, fput_needed);
+	return __fdget(fd, FMODE_PATH);
 }
-EXPORT_SYMBOL(fget_light);
+EXPORT_SYMBOL(fdget);
 
-struct file *fget_raw_light(unsigned int fd, int *fput_needed)
+struct fd fdget_raw(unsigned int fd)
 {
-	return __fget_light(fd, 0, fput_needed);
+	return __fdget(fd, 0);
 }
 
 void set_close_on_exec(unsigned int fd, int flag)
diff --git a/include/linux/file.h b/include/linux/file.h
index f2517fa2d610..1f6b39394a9e 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -20,12 +20,6 @@ struct path;
 extern struct file *alloc_file(struct path *, fmode_t mode,
 	const struct file_operations *fop);
 
-static inline void fput_light(struct file *file, int fput_needed)
-{
-	if (fput_needed)
-		fput(file);
-}
-
 struct fd {
 	struct file *file;
 	unsigned int flags;
@@ -40,24 +34,10 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
-extern struct file *fget_light(unsigned int fd, int *fput_needed);
-
-static inline struct fd fdget(unsigned int fd)
-{
-	int b;
-	struct file *f = fget_light(fd, &b);
-	return (struct fd){f,b};
-}
-
 extern struct file *fget_raw(unsigned int fd);
-extern struct file *fget_raw_light(unsigned int fd, int *fput_needed);
 
-static inline struct fd fdget_raw(unsigned int fd)
-{
-	int b;
-	struct file *f = fget_raw_light(fd, &b);
-	return (struct fd){f,b};
-}
+extern struct fd fdget(unsigned int fd);
+extern struct fd fdget_raw(unsigned int fd);
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
diff --git a/net/socket.c b/net/socket.c
index 879933aaed4c..5ef37af499b3 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -448,18 +448,25 @@ struct socket *sockfd_lookup(int fd, int *err)
 }
 EXPORT_SYMBOL(sockfd_lookup);
 
+static void fput_light(struct file *file, int flags)
+{
+	struct fd f = { file, flags };
+	fdput(f);
+}
+
 static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
 {
-	struct file *file;
+	struct fd f;
 	struct socket *sock;
 
 	*err = -EBADF;
-	file = fget_light(fd, fput_needed);
-	if (file) {
-		sock = sock_from_file(file, err);
+	f = fdget(fd);
+	*fput_needed = f.flags;
+	if (f.file) {
+		sock = sock_from_file(f.file, err);
 		if (sock)
 			return sock;
-		fput_light(file, *fput_needed);
+		fdput(f);
 	}
 	return NULL;
 }
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:52   ` Linus Torvalds
  2014-03-03 22:01     ` Al Viro
@ 2014-03-03 22:55     ` Linus Torvalds
  2014-03-03 23:23       ` Linus Torvalds
  2014-03-04  0:07     ` George Spelvin
  2 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 22:55 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 1:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>>
>> Most of the cases have it kept separately in registers, actually - there's
>> a reason why fdget() and friends are inlined.
>
> Yes. And bit test and set ops on registers are actually cheaper than
> playing around with bytes.

Ugh. gcc gets this *horribly* wrong when it's a bitfield in a union.

The

    f.need_pos_unlock = 1;

thing *should* be just a simple "orl $2,reg", but bitfields seem to
generate really crappy code, and it actually does some insane shifting
and masking crud with the constant "1".

gcc has had problems with bitfields before.

I think I'll respin this with the compat readv/writev case fixed and
with the bitfield replaced with an "unsigned int" that we just do
bitops by hand on. Because that code generation makes me feel slightly
ill.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:10         ` Al Viro
@ 2014-03-03 22:22           ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 22:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 3, 2014 at 2:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> It can - compat read/write/read/writev are also there these days.

There is no "compat_read/write()" - or I would have noticed it when
grepping. Regular read/write is fine for the compat case.

It's only readv/writev that have compat versions, because the vectors
need to be converted,

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:09         ` Al Viro
@ 2014-03-03 22:20           ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 22:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 3, 2014 at 2:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Yes, but in that case fdget() has grabbed a reference to that sucker,
> so the only way to end with refcount 1 is to have the damn thing gone
> from descriptor table in between.  And AFAICS in that case we are just
> fine without f_pos_lock.

Ahh. Yes, you're right, just checking the file_count() is sufficient,
because as you say, the threaded case will have incremented it for the
case we care about./

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:01     ` Al Viro
@ 2014-03-03 22:17       ` Linus Torvalds
  2014-03-03 23:28         ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 22:17 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

On Mon, Mar 3, 2014 at 2:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The thing is, the callers in there do *not* keep struct file * at all -
> they keep struct socket * and use sock->file to get struct file * back
> when they need it.

Not a problem. Just make sockfd_lookup_light() use the broken
inefficient calling convention.

So then the networking code can continue to use the old bad interface,
without it impacting the normal users.

Something like the attached untested patch. This gets rid of
"fget_light()", and instead makes "fdget()" the native interface (same
for the "raw" version).

Totally untested patch (on top of my previous one)

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 3673 bytes --]

 fs/file.c            | 30 +++++++++++++++---------------
 include/linux/file.h | 24 ++----------------------
 net/socket.c         | 17 ++++++++++++-----
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index db25c2bdfe46..cd0c1c329baf 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -683,33 +683,33 @@ EXPORT_SYMBOL(fget_raw);
  * The fput_needed flag returned by fget_light should be passed to the
  * corresponding fput_light.
  */
-struct file *__fget_light(unsigned int fd, fmode_t mask, int *fput_needed)
+static struct fd __fdget(unsigned int fd, fmode_t mask)
 {
 	struct files_struct *files = current->files;
-	struct file *file;
+	struct fd f = { NULL, };
 
-	*fput_needed = 0;
 	if (atomic_read(&files->count) == 1) {
-		file = __fcheck_files(files, fd);
-		if (file && (file->f_mode & mask))
-			file = NULL;
+		f.file = __fcheck_files(files, fd);
+		if (f.file && (f.file->f_mode & mask))
+			f.file = NULL;
 	} else {
-		file = __fget(fd, mask);
-		if (file)
-			*fput_needed = 1;
+		f.file = __fget(fd, mask);
+		if (f.file)
+			f.need_put = 1;
 	}
 
-	return file;
+	return f;
 }
-struct file *fget_light(unsigned int fd, int *fput_needed)
+
+struct fd fdget(unsigned int fd)
 {
-	return __fget_light(fd, FMODE_PATH, fput_needed);
+	return __fdget(fd, FMODE_PATH);
 }
-EXPORT_SYMBOL(fget_light);
+EXPORT_SYMBOL(fdget);
 
-struct file *fget_raw_light(unsigned int fd, int *fput_needed)
+struct fd fdget_raw(unsigned int fd)
 {
-	return __fget_light(fd, 0, fput_needed);
+	return __fdget(fd, 0);
 }
 
 void set_close_on_exec(unsigned int fd, int flag)
diff --git a/include/linux/file.h b/include/linux/file.h
index 636634a09c92..253402e6f3da 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -20,12 +20,6 @@ struct path;
 extern struct file *alloc_file(struct path *, fmode_t mode,
 	const struct file_operations *fop);
 
-static inline void fput_light(struct file *file, int fput_needed)
-{
-	if (fput_needed)
-		fput(file);
-}
-
 struct fd {
 	struct file *file;
 	unsigned need_put:1, need_pos_unlock:1;
@@ -38,24 +32,10 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
-extern struct file *fget_light(unsigned int fd, int *fput_needed);
-
-static inline struct fd fdget(unsigned int fd)
-{
-	int b;
-	struct file *f = fget_light(fd, &b);
-	return (struct fd){f,b,0};
-}
-
 extern struct file *fget_raw(unsigned int fd);
-extern struct file *fget_raw_light(unsigned int fd, int *fput_needed);
 
-static inline struct fd fdget_raw(unsigned int fd)
-{
-	int b;
-	struct file *f = fget_raw_light(fd, &b);
-	return (struct fd){f,b,0};
-}
+extern struct fd fdget(unsigned int fd);
+extern struct fd fdget_raw(unsigned int fd);
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
diff --git a/net/socket.c b/net/socket.c
index 879933aaed4c..4c9dc05f3b4b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -448,18 +448,25 @@ struct socket *sockfd_lookup(int fd, int *err)
 }
 EXPORT_SYMBOL(sockfd_lookup);
 
+static void fput_light(struct file *file, int fput_needed)
+{
+	struct fd f = { file, fput_needed, 0};
+	fdput(f);
+}
+
 static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
 {
-	struct file *file;
+	struct fd f;
 	struct socket *sock;
 
 	*err = -EBADF;
-	file = fget_light(fd, fput_needed);
-	if (file) {
-		sock = sock_from_file(file, err);
+	f = fdget(fd);
+	*fput_needed = f.need_put;
+	if (f.file) {
+		sock = sock_from_file(f.file, err);
 		if (sock)
 			return sock;
-		fput_light(file, *fput_needed);
+		fdput(f);
 	}
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 22:01       ` Linus Torvalds
@ 2014-03-03 22:10         ` Al Viro
  2014-03-03 22:22           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 03, 2014 at 02:01:33PM -0800, Linus Torvalds wrote:
> On Mon, Mar 3, 2014 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > And it looks like you've missed compat counterparts.
> 
> Grr. Yes, I missed them for compat_readv/writev - even though I
> grepped for various users, but I grepped for all the *other* cases (ie
> vfs_{read|write|llseek}). That means that the locker helper can't be
> static to fs/read_write.c. Oh well.

It can - compat read/write/read/writev are also there these days.  I think
that's the only thing you've missed, but I've several more places to trace
before I'll be sure.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:56       ` Linus Torvalds
@ 2014-03-03 22:09         ` Al Viro
  2014-03-03 22:20           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 03, 2014 at 01:56:31PM -0800, Linus Torvalds wrote:
> On Mon, Mar 3, 2014 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Um...  That's odd - we *could* get there with f.need_put and
> > file_count(file) equal to 1, but why would we want to take
> > f_pos_lock in that case?
> 
> Because that means that the file table is shared among threads. So
> another thread can access the struct file pointer and do a concurrent
> read() or write() on it, and so we need to lock f_pos.

Yes, but in that case fdget() has grabbed a reference to that sucker,
so the only way to end with refcount 1 is to have the damn thing gone
from descriptor table in between.  And AFAICS in that case we are just
fine without f_pos_lock.

>  - concurrent access due to duplicated "struct file_table" pointers.
> This is the "need_put" test, since __fget_light() will have tested the
> proper files->count already.

Shared descriptor table means that we'd better have a reference grabbed
already...

I agree that file_count(file) > 1 for any reason requires locking it;
it's just that need_put means exactly that we have grabbed a reference
ourselves.  So having need_put && file_count <= 1 means that we have
grabbed it and then somebody dropped all other references.  Including
the ones in descriptor table(s), shared or not, etc.  In that case they'd
better not touch that struct file anymore - after all, once we are done
with whatever we are doing, we'll do fput() and it will be ripped from
under them with no warning.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:45     ` Al Viro
  2014-03-03 21:56       ` Linus Torvalds
@ 2014-03-03 22:01       ` Linus Torvalds
  2014-03-03 22:10         ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 22:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 3, 2014 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> And it looks like you've missed compat counterparts.

Grr. Yes, I missed them for compat_readv/writev - even though I
grepped for various users, but I grepped for all the *other* cases (ie
vfs_{read|write|llseek}). That means that the locker helper can't be
static to fs/read_write.c. Oh well.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:52   ` Linus Torvalds
@ 2014-03-03 22:01     ` Al Viro
  2014-03-03 22:17       ` Linus Torvalds
  2014-03-03 22:55     ` Linus Torvalds
  2014-03-04  0:07     ` George Spelvin
  2 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 22:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 03, 2014 at 01:52:13PM -0800, Linus Torvalds wrote:
> On Mon, Mar 3, 2014 at 1:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Mon, Mar 03, 2014 at 04:03:59PM -0500, George Spelvin wrote:
> >>
> >> (If you want to use bits, why not use the two lsbits of the file pointer
> >> for the purpose?  That would save a lot of space.)
> >
> > Most of the cases have it kept separately in registers, actually - there's
> > a reason why fdget() and friends are inlined.
> 
> Yes. And bit test and set ops on registers are actually cheaper than
> playing around with bytes.
> 
> That said, the "fget_light()" interface sucks - exactly because it
> doesn't do the "return structure in two registers" thing. We should
> get rid of it - there's just one remaining user in networking code,
> and it should be rewritten in terms of fdget().
> 
> That's a separate issue, though.

The thing is, the callers in there do *not* keep struct file * at all -
they keep struct socket * and use sock->file to get struct file * back
when they need it.

So struct fd is the wrong thing to use there - it only adds to register
pressure.  A similar pair of struct socket * and "need to fput that"
flag would probably be needed, with sockfd_lookup_light() returning
that.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:45     ` Al Viro
@ 2014-03-03 21:56       ` Linus Torvalds
  2014-03-03 22:09         ` Al Viro
  2014-03-03 22:01       ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 21:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 3, 2014 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Um...  That's odd - we *could* get there with f.need_put and
> file_count(file) equal to 1, but why would we want to take
> f_pos_lock in that case?

Because that means that the file table is shared among threads. So
another thread can access the struct file pointer and do a concurrent
read() or write() on it, and so we need to lock f_pos.

Basically, there are two cases:

 - duplicated file pointers due to fork(). That's the "file_count()" test.

   Yes, this will trigger even if they didn't fork, just dup'ed the
file descriptor. We have no way of telling the difference, though.

 - concurrent access due to duplicated "struct file_table" pointers.
This is the "need_put" test, since __fget_light() will have tested the
proper files->count already.

Both need f_pos_lock.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:26 ` Al Viro
@ 2014-03-03 21:52   ` Linus Torvalds
  2014-03-03 22:01     ` Al Viro
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 21:52 UTC (permalink / raw)
  To: Al Viro; +Cc: George Spelvin, linux-fsdevel, Linux Kernel Mailing List

On Mon, Mar 3, 2014 at 1:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Mar 03, 2014 at 04:03:59PM -0500, George Spelvin wrote:
>>
>> (If you want to use bits, why not use the two lsbits of the file pointer
>> for the purpose?  That would save a lot of space.)
>
> Most of the cases have it kept separately in registers, actually - there's
> a reason why fdget() and friends are inlined.

Yes. And bit test and set ops on registers are actually cheaper than
playing around with bytes.

That said, the "fget_light()" interface sucks - exactly because it
doesn't do the "return structure in two registers" thing. We should
get rid of it - there's just one remaining user in networking code,
and it should be rewritten in terms of fdget().

That's a separate issue, though.

Oh, and George, your email setup is broken. Gmail thinks your emails
are spam. I'm not sure why (the usual spf problems do not apply), but
it's possibly because your name and email looks made up.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 17:36   ` Linus Torvalds
@ 2014-03-03 21:45     ` Al Viro
  2014-03-03 21:56       ` Linus Torvalds
  2014-03-03 22:01       ` Linus Torvalds
  2014-03-06 15:03     ` Michael Kerrisk (man-pages)
  1 sibling, 2 replies; 43+ messages in thread
From: Al Viro @ 2014-03-03 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Kerrisk (man-pages),
	lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, J. Bruce Fields,
	Yongzhi Pan

On Mon, Mar 03, 2014 at 09:36:58AM -0800, Linus Torvalds wrote:
> +static struct fd fdget_pos(int fd)
> +{
> +	struct fd f = fdget(fd);
> +	struct file *file = f.file;
> +
> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> +		if (f.need_put || file_count(file) > 1) {

Um...  That's odd - we *could* get there with f.need_put and
file_count(file) equal to 1, but why would we want to take
f_pos_lock in that case?

And it looks like you've missed compat counterparts.  Other than
that, it'd probably work...  Let me check if there are other
paths we might need to take care of, though.

PS: apologies for being pretty much MIA for the last two weeks; I'd been
ears-deep in interesting scalability crap around vfsmount handling ;-/
Will post the resulting patches for review shortly...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-03-03 21:03 George Spelvin
@ 2014-03-03 21:26 ` Al Viro
  2014-03-03 21:52   ` Linus Torvalds
  2014-05-04  7:04 ` Michael Kerrisk
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2014-03-03 21:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, linux-fsdevel, linux-kernel

On Mon, Mar 03, 2014 at 04:03:59PM -0500, George Spelvin wrote:
> >  struct fd {
> >  	struct file *file;
> > -	int need_put;
> > +	unsigned need_put:1, need_pos_unlock:1;
> >  };
> 
> Since we're rounding up to 2*sizeof(struct file *) anyway, is this a case
> where wasting space on a couple of char (or bool) flags would generate
> better code than a bitfield?
> 
> In particular, the code to set need_pos_unlock (which will be executed
> each read/write for most files) is messy in the bitfield case.
> A byte store is much cleaner.
> 
> (If you want to use bits, why not use the two lsbits of the file pointer
> for the purpose?  That would save a lot of space.)

Most of the cases have it kept separately in registers, actually - there's
a reason why fdget() and friends are inlined.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
@ 2014-03-03 21:03 George Spelvin
  2014-03-03 21:26 ` Al Viro
  2014-05-04  7:04 ` Michael Kerrisk
  0 siblings, 2 replies; 43+ messages in thread
From: George Spelvin @ 2014-03-03 21:03 UTC (permalink / raw)
  To: torvalds; +Cc: linux, linux-fsdevel, linux-kernel, viro

>  struct fd {
>  	struct file *file;
> -	int need_put;
> +	unsigned need_put:1, need_pos_unlock:1;
>  };

Since we're rounding up to 2*sizeof(struct file *) anyway, is this a case
where wasting space on a couple of char (or bool) flags would generate
better code than a bitfield?

In particular, the code to set need_pos_unlock (which will be executed
each read/write for most files) is messy in the bitfield case.
A byte store is much cleaner.

(If you want to use bits, why not use the two lsbits of the file pointer
for the purpose?  That would save a lot of space.)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-20 17:14 ` Linus Torvalds
@ 2014-03-03 17:36   ` Linus Torvalds
  2014-03-03 21:45     ` Al Viro
  2014-03-06 15:03     ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2014-03-03 17:36 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, Al Viro,
	J. Bruce Fields, Yongzhi Pan

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]

Ok, sorry for the long delay, I was distracted (and hoping that Al
would come up with a patch).

Anyway, attached is the patch I think we should do for this issue. It
is fairly simple:

 - it adds a "f_pos_mutex" to the "struct file".

 - it adds a new FMODE_ATOMIC_POS flag to the file mode flags to mark
things that need atomic f_pos updates

 - it makes the "struct fd" flags be two flags rather than one: the
second flag is for "unlock f_pos_mutex when done"

 - it introduces "fd[get,put]_pos()" which gets the f_pos_mutex when required

 - it makes read/write/lseek use that.

It's pretty damn straightforward, I think, and is minimally serializing.

Al, comments? Yongzhi Pan, this is pretty much untested, but it's
pretty simple and it does fix your test-case.

                       Linus

On Thu, Feb 20, 2014 at 9:14 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Yes, I do think we violate POSIX here because of how we handle f_pos
> (the earlier thread from 2006 you point to talks about the "thread
> safe" part, the point here about the actual wording of "atomic" is a
> separate issue).
>
> Long long ago we used to just pass in the pointer to f_pos directly,
> and then the low-level write would update it all under the inode
> semaphore, and all was good.
>
> And then we ended up having tons of problems with non-regular files
> and drivers accessing f_pos and having nasty races with it because
> they did *not* have any locking (and very fundamentally didn't want
> any), and we broke the serialization of f_pos. We still do the *IO*
> atomically, but yes, the f_pos access itself is outside the lock.
>
> Ho humm.. Al, any ideas of how to fix this?

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 6243 bytes --]

 fs/file_table.c      |  1 +
 fs/open.c            |  4 ++++
 fs/read_write.c      | 46 ++++++++++++++++++++++++++++++++++++----------
 include/linux/file.h |  6 +++---
 include/linux/fs.h   |  6 +++++-
 5 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5fff9030be34..5b24008ea4f6 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -135,6 +135,7 @@ struct file *get_empty_filp(void)
 	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	spin_lock_init(&f->f_lock);
+	mutex_init(&f->f_pos_lock);
 	eventpoll_init_file(f);
 	/* f->f_version: 0 */
 	return f;
diff --git a/fs/open.c b/fs/open.c
index 4b3e1edf2fe4..b9ed8b25c108 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -705,6 +705,10 @@ static int do_dentry_open(struct file *f,
 		return 0;
 	}
 
+	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
+	if (S_ISREG(inode->i_mode))
+		f->f_mode |= FMODE_ATOMIC_POS;
+
 	f->f_op = fops_get(inode->i_fop);
 	if (unlikely(WARN_ON(!f->f_op))) {
 		error = -ENODEV;
diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..9273ec92f07e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,10 +264,36 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
+/*
+ * We only lock f_pos if we have threads (in which case "need_put"
+ * will be set) or if the file might be shared with another process
+ * (in which case the file count is elevated).
+ */
+static struct fd fdget_pos(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+
+	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
+		if (f.need_put || file_count(file) > 1) {
+			f.need_pos_unlock = 1;
+			mutex_lock(&file->f_pos_lock);
+		}
+	}
+	return f;
+}
+
+static void fdput_pos(struct fd f)
+{
+	if (f.need_pos_unlock)
+		mutex_unlock(&f.file->f_pos_lock);
+	fdput(f);
+}
+
 SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 {
 	off_t retval;
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	if (!f.file)
 		return -EBADF;
 
@@ -278,7 +304,7 @@ SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 		if (res != (loff_t)retval)
 			retval = -EOVERFLOW;	/* LFS: should only happen on 32 bit platforms */
 	}
-	fdput(f);
+	fdput_pos(f);
 	return retval;
 }
 
@@ -498,7 +524,7 @@ static inline void file_pos_write(struct file *file, loff_t pos)
 
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -506,7 +532,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 		ret = vfs_read(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 	return ret;
 }
@@ -514,7 +540,7 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -522,7 +548,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		ret = vfs_write(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	return ret;
@@ -797,7 +823,7 @@ EXPORT_SYMBOL(vfs_writev);
 SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -805,7 +831,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_readv(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
@@ -817,7 +843,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
@@ -825,7 +851,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		ret = vfs_writev(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
-		fdput(f);
+		fdput_pos(f);
 	}
 
 	if (ret > 0)
diff --git a/include/linux/file.h b/include/linux/file.h
index cbacf4faf447..636634a09c92 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -28,7 +28,7 @@ static inline void fput_light(struct file *file, int fput_needed)
 
 struct fd {
 	struct file *file;
-	int need_put;
+	unsigned need_put:1, need_pos_unlock:1;
 };
 
 static inline void fdput(struct fd fd)
@@ -44,7 +44,7 @@ static inline struct fd fdget(unsigned int fd)
 {
 	int b;
 	struct file *f = fget_light(fd, &b);
-	return (struct fd){f,b};
+	return (struct fd){f,b,0};
 }
 
 extern struct file *fget_raw(unsigned int fd);
@@ -54,7 +54,7 @@ static inline struct fd fdget_raw(unsigned int fd)
 {
 	int b;
 	struct file *f = fget_raw_light(fd, &b);
-	return (struct fd){f,b};
+	return (struct fd){f,b,0};
 }
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 60829565e552..ebfde04bca06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -123,6 +123,9 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is opened with O_PATH; almost nothing can be done with it */
 #define FMODE_PATH		((__force fmode_t)0x4000)
 
+/* File needs atomic accesses to f_pos */
+#define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
@@ -780,13 +783,14 @@ struct file {
 	const struct file_operations	*f_op;
 
 	/*
-	 * Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR.
+	 * Protects f_ep_links, f_flags.
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
+	struct mutex		f_pos_lock;
 	loff_t			f_pos;
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-17 15:41 Michael Kerrisk (man-pages)
  2014-02-18 13:00 ` Michael Kerrisk
@ 2014-02-20 17:14 ` Linus Torvalds
  2014-03-03 17:36   ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-02-20 17:14 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: lkml, Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, Al Viro,
	J. Bruce Fields, Yongzhi Pan

Yes, I do think we violate POSIX here because of how we handle f_pos
(the earlier thread from 2006 you point to talks about the "thread
safe" part, the point here about the actual wording of "atomic" is a
separate issue).

Long long ago we used to just pass in the pointer to f_pos directly,
and then the low-level write would update it all under the inode
semaphore, and all was good.

And then we ended up having tons of problems with non-regular files
and drivers accessing f_pos and having nasty races with it because
they did *not* have any locking (and very fundamentally didn't want
any), and we broke the serialization of f_pos. We still do the *IO*
atomically, but yes, the f_pos access itself is outside the lock.

Ho humm.. Al, any ideas of how to fix this?

                     Linus



On Mon, Feb 17, 2014 at 7:41 AM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello all,
>
> A note from Yongzhi Pan about some of my own code led me to dig deeper
> and discover behavior that is surprising and also seems to be a
> fairly clear violation of POSIX requirements.
>
> It appears that write() (and, presumably read() and other similar
> system calls) are not atomic with respect to performing I/O and
> updating the file offset behavior.
>
> The problem can be demonstrated using the program below.
> That program takes three arguments:
>
> $ ./multi_writer num-children num-blocks block-size > somefile
>
> It creates 'num-children' children, each of which writes 'num-blocks'
> blocks of 'block-size' bytes to standard output; for my experiments,
> stdout is redirected to a file. After all children have finished,
> the parent inspects the size of the file written on stdout, calculates
> the expected size of the file, and displays these two values, and
> their difference on stderr.
>
> Some observations:
>
> * All children inherit the stdout file descriptor from the parent;
>   thus the FDs refer to the same open file description, and therefore
>   share the file offset.
>
> * When I run this on a multi-CPU BSD systems, I get the expected result:
>
> $ ./multi_writer 10 10000 1000 > g 2> run.log
> $ ls -l g
> -rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g
>
> * Someone else tested this code for me on a Solaris system, and also got
>   the expected result.
>
> * On Linux, by contrast, we see behavior such as the following:
>
> $ ./multi_writer 10 10000 1000 > g
> Expected file size:  100000000
> Actual file size:     16323000
> Difference:           83677000
> $ ls -l g
> -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g
>
> Summary of the above output: some children are overwriting the output
> of other children because output is not atomic with respect to updates
> to the file offset.
>
> For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:
>
> [[
> 2.9.7 Thread Interactions with Regular File Operations
>
> All of the following functions shall be atomic with respect to each other
> in the effects specified in POSIX.1-2008 when they operate on regular
> files or symbolic links:
>
>
> chmod()
> ...
> pread()
> read()
> ...
> readv()
> pwrite()
> ...
> write()
> writev()
>
>
> If two threads each call one of these functions, each call shall either
> see all of the specified effects of the other call, or none of them.
> ]]
>
> (POSIX.1-2001 has similar text.)
>
> This text is in one of the Threads sections, but it applies equally
> to threads in different processes as to threads in the same process.
>
> I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a
> number of other recent kernels, all with similar results, which suggests
> the result is in the VFS layer. (Can it really be so simple as no locking
> around pieces such as
>
>                 loff_t pos = file_pos_read(f.file);
>                 ret = vfs_write(f.file, buf, count, &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>
> in fs/read_write.c?)
>
> I discovered this behavior after Yongzhi Pan reported some unexpected
> behavior in some of my code that forked to create a parent and
> child that wrote to the same file. In some cases, expected output
> was not appearing. In other words, after a fork(), and in the absence
> of any other synchronization technique, a parent and a child cannot
> safely write to the same file descriptor without risking overwriting
> each other's output. But POSIX requires this, and other systems seem
> to guarantee it.
>
> Am I correct to think there's a kernel problem here?
>
> Thanks,
>
> Michael
>
> ===
>
> /* multi_writer.c
> */
>
> #include <sys/wait.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <errno.h>
>
> typedef enum { FALSE, TRUE } Boolean;
>
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                         } while (0)
>
> #define fatal(msg)      do { fprintf(stderr, "%s\n", msg); \
>                              exit(EXIT_FAILURE); } while (0)
>
> #define usageErr(msg, progName)        \
>                         do { fprintf(stderr, "Usage: "); \
>                              fprintf(stderr, msg, progName); \
>                              exit(EXIT_FAILURE); } while (0)
>
> int
> main(int argc, char *argv[])
> {
>     char *buf;
>     int j, k, nblocks, nchildren;
>     size_t blocksize;
>     struct stat sb;
> //  int nchanges;
> //  off_t pos;
>     long long expected;
>
>     if (argc < 4 || strcmp(argv[1], "--help") == 0)
>         usageErr("%s num-children num-blocks block-size [O_APPEND-flag]\n",
>                 argv[0]);
>
>     nblocks = atoi(argv[2]);
>     blocksize = atoi(argv[3]);
>
>     buf = malloc(blocksize + 1);
>     if (buf == NULL)
>         errExit("malloc");
>
>     /* If a fourth command-line argument is specified, set the O_APPEND
>        flag on stdout */
>
>     if (argc > 4)
>         if (fcntl(STDOUT_FILENO, F_SETFL, O_APPEND) == -1)
>             errExit("fcntl-F_SETFL");
>
>     nchildren = atoi(argv[1]);
>
>     /* Create child processes that write blocks to stdout */
>
>     for (j = 0; j < nchildren; j++) {
>         switch(fork()) {
>         case -1:
>             errExit("fork");
>
>         case 0: /* Each child writes nblocks * blocksize bytes to stdout */
> //          nchanges = 0;
>
>             /* Put something distinctive in each child's buffer (in case
>                we want to analyze byte sequences in the output) */
>
>             for (k = 0; k < blocksize; k++)
>                 buf[k] = 'a' + getpid() % 26;
>
>             for (k = 0; k < nblocks; k++) {
> //              if (k > 0 && pos != lseek(STDOUT_FILENO, 0, SEEK_END))
> //                  nchanges++;
>                 if (write(STDOUT_FILENO, buf, blocksize) != blocksize)
>                     fatal("write");
> //              pos = lseek(STDOUT_FILENO, 0, SEEK_END);
>             }
>
> //          fprintf(stderr, "%ld: nchanges = %d\n",
> //                  (long) getpid(), nchanges);
>             exit(EXIT_SUCCESS);
>
>         default:
>             break;      /* Parent falls through to create next child */
>         }
>     }
>
>     /* Wait for all children to terminate */
>
>     while (wait(NULL) > 0)
>         continue;
>
>     /* Compare final length of file against expected size */
>
>     if (fstat(STDOUT_FILENO, &sb) == -1)
>         errExit("fstat");
>
>     expected =  blocksize * nblocks * nchildren;
>     fprintf(stderr, "Expected file size: %10lld\n", expected);
>     fprintf(stderr, "Actual file size:   %10lld\n", (long long) sb.st_size);
>     fprintf(stderr, "Difference:         %10lld\n", expected - sb.st_size);
>
>     exit(EXIT_SUCCESS);
> }
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Update of file offset on write() etc. is non-atomic with I/O
  2014-02-17 15:41 Michael Kerrisk (man-pages)
@ 2014-02-18 13:00 ` Michael Kerrisk
  2014-02-20 17:14 ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Michael Kerrisk @ 2014-02-18 13:00 UTC (permalink / raw)
  To: lkml
  Cc: Miklos Szeredi, Theodore T'so, Christoph Hellwig,
	Chris Mason, Dave Chinner, Linux-Fsdevel, Al Viro,
	J. Bruce Fields, Yongzhi Pan, Michael Kerrisk (gmail),
	Andrew Morton, Linus Torvalds, Alan Cox, Kyle Moffett,
	Karaoui mohamed lamine

[expanding the CC list a little more to bring in some previously
interested parties]

On Mon, Feb 17, 2014 at 4:41 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello all,
>
> A note from Yongzhi Pan about some of my own code led me to dig deeper
> and discover behavior that is surprising and also seems to be a
> fairly clear violation of POSIX requirements.
>
> It appears that write() (and, presumably read() and other similar
> system calls) are not atomic with respect to performing I/O and
> updating the file offset behavior.
>
> The problem can be demonstrated using the program below.
> That program takes three arguments:
>
> $ ./multi_writer num-children num-blocks block-size > somefile
>
> It creates 'num-children' children, each of which writes 'num-blocks'
> blocks of 'block-size' bytes to standard output; for my experiments,
> stdout is redirected to a file. After all children have finished,
> the parent inspects the size of the file written on stdout, calculates
> the expected size of the file, and displays these two values, and
> their difference on stderr.
>
> Some observations:
>
> * All children inherit the stdout file descriptor from the parent;
>   thus the FDs refer to the same open file description, and therefore
>   share the file offset.
>
> * When I run this on a multi-CPU BSD systems, I get the expected result:
>
> $ ./multi_writer 10 10000 1000 > g 2> run.log
> $ ls -l g
> -rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g
>
> * Someone else tested this code for me on a Solaris system, and also got
>   the expected result.
>
> * On Linux, by contrast, we see behavior such as the following:
>
> $ ./multi_writer 10 10000 1000 > g
> Expected file size:  100000000
> Actual file size:     16323000
> Difference:           83677000
> $ ls -l g
> -rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g
>
> Summary of the above output: some children are overwriting the output
> of other children because output is not atomic with respect to updates
> to the file offset.
>
> For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:
>
> [[
> 2.9.7 Thread Interactions with Regular File Operations
>
> All of the following functions shall be atomic with respect to each other
> in the effects specified in POSIX.1-2008 when they operate on regular
> files or symbolic links:
>
>
> chmod()
> ...
> pread()
> read()
> ...
> readv()
> pwrite()
> ...
> write()
> writev()
>
>
> If two threads each call one of these functions, each call shall either
> see all of the specified effects of the other call, or none of them.
> ]]
>
> (POSIX.1-2001 has similar text.)
>
> This text is in one of the Threads sections, but it applies equally
> to threads in different processes as to threads in the same process.
>
> I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a
> number of other recent kernels, all with similar results, which suggests
> the result is in the VFS layer. (Can it really be so simple as no locking
> around pieces such as
>
>                 loff_t pos = file_pos_read(f.file);
>                 ret = vfs_write(f.file, buf, count, &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>
> in fs/read_write.c?)
>
> I discovered this behavior after Yongzhi Pan reported some unexpected
> behavior in some of my code that forked to create a parent and
> child that wrote to the same file. In some cases, expected output
> was not appearing. In other words, after a fork(), and in the absence
> of any other synchronization technique, a parent and a child cannot
> safely write to the same file descriptor without risking overwriting
> each other's output. But POSIX requires this, and other systems seem
> to guarantee it.
>
> Am I correct to think there's a kernel problem here?
>
> Thanks,
>
> Michael
>
> ===
>
> /* multi_writer.c
> */
>
> #include <sys/wait.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <errno.h>
>
> typedef enum { FALSE, TRUE } Boolean;
>
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                         } while (0)
>
> #define fatal(msg)      do { fprintf(stderr, "%s\n", msg); \
>                              exit(EXIT_FAILURE); } while (0)
>
> #define usageErr(msg, progName)        \
>                         do { fprintf(stderr, "Usage: "); \
>                              fprintf(stderr, msg, progName); \
>                              exit(EXIT_FAILURE); } while (0)
>
> int
> main(int argc, char *argv[])
> {
>     char *buf;
>     int j, k, nblocks, nchildren;
>     size_t blocksize;
>     struct stat sb;
> //  int nchanges;
> //  off_t pos;
>     long long expected;
>
>     if (argc < 4 || strcmp(argv[1], "--help") == 0)
>         usageErr("%s num-children num-blocks block-size [O_APPEND-flag]\n",
>                 argv[0]);
>
>     nblocks = atoi(argv[2]);
>     blocksize = atoi(argv[3]);
>
>     buf = malloc(blocksize + 1);
>     if (buf == NULL)
>         errExit("malloc");
>
>     /* If a fourth command-line argument is specified, set the O_APPEND
>        flag on stdout */
>
>     if (argc > 4)
>         if (fcntl(STDOUT_FILENO, F_SETFL, O_APPEND) == -1)
>             errExit("fcntl-F_SETFL");
>
>     nchildren = atoi(argv[1]);
>
>     /* Create child processes that write blocks to stdout */
>
>     for (j = 0; j < nchildren; j++) {
>         switch(fork()) {
>         case -1:
>             errExit("fork");
>
>         case 0: /* Each child writes nblocks * blocksize bytes to stdout */
> //          nchanges = 0;
>
>             /* Put something distinctive in each child's buffer (in case
>                we want to analyze byte sequences in the output) */
>
>             for (k = 0; k < blocksize; k++)
>                 buf[k] = 'a' + getpid() % 26;
>
>             for (k = 0; k < nblocks; k++) {
> //              if (k > 0 && pos != lseek(STDOUT_FILENO, 0, SEEK_END))
> //                  nchanges++;
>                 if (write(STDOUT_FILENO, buf, blocksize) != blocksize)
>                     fatal("write");
> //              pos = lseek(STDOUT_FILENO, 0, SEEK_END);
>             }
>
> //          fprintf(stderr, "%ld: nchanges = %d\n",
> //                  (long) getpid(), nchanges);
>             exit(EXIT_SUCCESS);
>
>         default:
>             break;      /* Parent falls through to create next child */
>         }
>     }
>
>     /* Wait for all children to terminate */
>
>     while (wait(NULL) > 0)
>         continue;
>
>     /* Compare final length of file against expected size */
>
>     if (fstat(STDOUT_FILENO, &sb) == -1)
>         errExit("fstat");
>
>     expected =  blocksize * nblocks * nchildren;
>     fprintf(stderr, "Expected file size: %10lld\n", expected);
>     fprintf(stderr, "Actual file size:   %10lld\n", (long long) sb.st_size);
>     fprintf(stderr, "Difference:         %10lld\n", expected - sb.st_size);
>
>     exit(EXIT_SUCCESS);
> }
>

Offlist, I was pointed to some previous threads on this topic:

http://thread.gmane.org/gmane.linux.kernel.kernelnewbies/43508

http://lwn.net/Articles/180387/

http://lwn.net/Articles/180396/

Notwithstanding the comments  of Alan and Linus at
http://thread.gmane.org/gmane.linux.kernel/397980/focus=398248 and
http://thread.gmane.org/gmane.linux.kernel/397980/focus=398281 this
*is* a violation of POSIX. When XSI 2.9.7 talks uses the "threads"
language, that is to provide the widest possible guarantee--i.e., that
the threads within a process also get the same atomicity guarantees as
threads in different processes.

I can't comment on the performance implications of adding locking to
fix this issue (at least for simultaneous I/O), but there is an
argument that it should be done on correctness grounds. Linux isn't
conformant with SUSv3 and SUSv4, and isn't consistent with other
implementations such as FreeBSD and Solaris. And I'm pretty sure Linux
isn't consistent with UNIX since early times. (E.g., page 191 of the
1992 edition of Stevens APUE discusses the sharing of the file offset
between the parent and child after fork(). Although Stevens didn't
explicitly spell out the atomicity guarantee, the discussion there
would be a bit nonsensical without the presumption of that guarantee.)

Thanks,

Michael

PS I tried hacking Linus's untested 2006 patch
(http://lwn.net/Articles/180396/) into the kernel, but perhaps not
surprisingly given the age of that patch, I ran into various errors
after boot that meant that user-space didn't come up.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Update of file offset on write() etc. is non-atomic with I/O
@ 2014-02-17 15:41 Michael Kerrisk (man-pages)
  2014-02-18 13:00 ` Michael Kerrisk
  2014-02-20 17:14 ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-17 15:41 UTC (permalink / raw)
  To: lkml
  Cc: mtk.manpages, Miklos Szeredi, Theodore T'so,
	Christoph Hellwig, Chris Mason, Dave Chinner, Linux-Fsdevel,
	Al Viro, J. Bruce Fields, Yongzhi Pan, Michael Kerrisk (gmail)

Hello all,

A note from Yongzhi Pan about some of my own code led me to dig deeper 
and discover behavior that is surprising and also seems to be a 
fairly clear violation of POSIX requirements.

It appears that write() (and, presumably read() and other similar 
system calls) are not atomic with respect to performing I/O and 
updating the file offset behavior. 

The problem can be demonstrated using the program below.
That program takes three arguments:

$ ./multi_writer num-children num-blocks block-size > somefile

It creates 'num-children' children, each of which writes 'num-blocks'
blocks of 'block-size' bytes to standard output; for my experiments,
stdout is redirected to a file. After all children have finished,
the parent inspects the size of the file written on stdout, calculates
the expected size of the file, and displays these two values, and 
their difference on stderr.

Some observations:

* All children inherit the stdout file descriptor from the parent;
  thus the FDs refer to the same open file description, and therefore
  share the file offset.

* When I run this on a multi-CPU BSD systems, I get the expected result:

$ ./multi_writer 10 10000 1000 > g 2> run.log
$ ls -l g
-rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g

* Someone else tested this code for me on a Solaris system, and also got 
  the expected result.

* On Linux, by contrast, we see behavior such as the following:
  
$ ./multi_writer 10 10000 1000 > g 
Expected file size:  100000000
Actual file size:     16323000
Difference:           83677000
$ ls -l g
-rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g

Summary of the above output: some children are overwriting the output
of other children because output is not atomic with respect to updates
to the file offset.

For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:

[[
2.9.7 Thread Interactions with Regular File Operations

All of the following functions shall be atomic with respect to each other 
in the effects specified in POSIX.1-2008 when they operate on regular 
files or symbolic links:


chmod()
...
pread()
read()
...
readv()
pwrite()
...
write()
writev()
 

If two threads each call one of these functions, each call shall either 
see all of the specified effects of the other call, or none of them.
]]

(POSIX.1-2001 has similar text.)

This text is in one of the Threads sections, but it applies equally 
to threads in different processes as to threads in the same process.

I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a 
number of other recent kernels, all with similar results, which suggests 
the result is in the VFS layer. (Can it really be so simple as no locking
around pieces such as 

                loff_t pos = file_pos_read(f.file);
                ret = vfs_write(f.file, buf, count, &pos);
                if (ret >= 0)
                        file_pos_write(f.file, pos);

in fs/read_write.c?)

I discovered this behavior after Yongzhi Pan reported some unexpected
behavior in some of my code that forked to create a parent and
child that wrote to the same file. In some cases, expected output
was not appearing. In other words, after a fork(), and in the absence 
of any other synchronization technique, a parent and a child cannot 
safely write to the same file descriptor without risking overwriting 
each other's output. But POSIX requires this, and other systems seem
to guarantee it.

Am I correct to think there's a kernel problem here?

Thanks,

Michael

===

/* multi_writer.c 
*/

#include <sys/wait.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/fcntl.h>
#include <sys/stat.h>
#include <string.h>
#include <errno.h>

typedef enum { FALSE, TRUE } Boolean;

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

#define fatal(msg)      do { fprintf(stderr, "%s\n", msg); \
                             exit(EXIT_FAILURE); } while (0)

#define usageErr(msg, progName)        \
                        do { fprintf(stderr, "Usage: "); \
                             fprintf(stderr, msg, progName); \
                             exit(EXIT_FAILURE); } while (0)

int
main(int argc, char *argv[])
{
    char *buf;
    int j, k, nblocks, nchildren;
    size_t blocksize;
    struct stat sb;
//  int nchanges;
//  off_t pos;
    long long expected;

    if (argc < 4 || strcmp(argv[1], "--help") == 0)
        usageErr("%s num-children num-blocks block-size [O_APPEND-flag]\n",
                argv[0]);

    nblocks = atoi(argv[2]);
    blocksize = atoi(argv[3]);

    buf = malloc(blocksize + 1);
    if (buf == NULL)
        errExit("malloc");

    /* If a fourth command-line argument is specified, set the O_APPEND
       flag on stdout */

    if (argc > 4)
        if (fcntl(STDOUT_FILENO, F_SETFL, O_APPEND) == -1)
            errExit("fcntl-F_SETFL");

    nchildren = atoi(argv[1]);

    /* Create child processes that write blocks to stdout */

    for (j = 0; j < nchildren; j++) {
        switch(fork()) {
        case -1:
            errExit("fork");

        case 0: /* Each child writes nblocks * blocksize bytes to stdout */
//          nchanges = 0;

	    /* Put something distinctive in each child's buffer (in case
	       we want to analyze byte sequences in the output) */

	    for (k = 0; k < blocksize; k++)
		buf[k] = 'a' + getpid() % 26;

            for (k = 0; k < nblocks; k++) {
//              if (k > 0 && pos != lseek(STDOUT_FILENO, 0, SEEK_END))
//                  nchanges++;
                if (write(STDOUT_FILENO, buf, blocksize) != blocksize)
                    fatal("write");
//              pos = lseek(STDOUT_FILENO, 0, SEEK_END);
            }

//          fprintf(stderr, "%ld: nchanges = %d\n",
//                  (long) getpid(), nchanges);
            exit(EXIT_SUCCESS);

        default:
            break;      /* Parent falls through to create next child */
        }
    }

    /* Wait for all children to terminate */

    while (wait(NULL) > 0)
        continue;

    /* Compare final length of file against expected size */

    if (fstat(STDOUT_FILENO, &sb) == -1)
        errExit("fstat");

    expected =  blocksize * nblocks * nchildren;
    fprintf(stderr, "Expected file size: %10lld\n", expected);
    fprintf(stderr, "Actual file size:   %10lld\n", (long long) sb.st_size);
    fprintf(stderr, "Difference:         %10lld\n", expected - sb.st_size);

    exit(EXIT_SUCCESS);
}


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-05-04  7:04 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <a8df285f-de7f-4a3a-9a19-e0ad07ab3a5c@blur>
2014-02-20 18:15 ` Update of file offset on write() etc. is non-atomic with I/O Zuckerman, Boris
2014-02-20 18:15   ` Zuckerman, Boris
2014-02-20 18:29   ` Al Viro
2014-02-21  6:01     ` Michael Kerrisk (man-pages)
2014-02-23  1:18       ` Kevin Easton
2014-02-23  7:38         ` Michael Kerrisk (man-pages)
2014-03-03 21:03 George Spelvin
2014-03-03 21:26 ` Al Viro
2014-03-03 21:52   ` Linus Torvalds
2014-03-03 22:01     ` Al Viro
2014-03-03 22:17       ` Linus Torvalds
2014-03-03 23:28         ` Al Viro
2014-03-03 23:34           ` Linus Torvalds
2014-03-03 23:42             ` Al Viro
2014-03-03 23:59               ` Linus Torvalds
2014-03-04  0:23                 ` Al Viro
2014-03-04  0:42                   ` Linus Torvalds
2014-03-04  1:05                     ` Al Viro
2014-03-04 20:00                       ` Al Viro
2014-03-04 21:17                         ` Linus Torvalds
2014-03-05  0:04                           ` Al Viro
2014-03-10 15:55                             ` Al Viro
2014-03-03 22:55     ` Linus Torvalds
2014-03-03 23:23       ` Linus Torvalds
2014-03-03 23:39         ` Al Viro
2014-03-03 23:54           ` Linus Torvalds
2014-03-03 23:54           ` Al Viro
2014-03-04 20:11           ` Cedric Blancher
2014-03-04  0:07     ` George Spelvin
2014-05-04  7:04 ` Michael Kerrisk
  -- strict thread matches above, loose matches on Subject: below --
2014-02-17 15:41 Michael Kerrisk (man-pages)
2014-02-18 13:00 ` Michael Kerrisk
2014-02-20 17:14 ` Linus Torvalds
2014-03-03 17:36   ` Linus Torvalds
2014-03-03 21:45     ` Al Viro
2014-03-03 21:56       ` Linus Torvalds
2014-03-03 22:09         ` Al Viro
2014-03-03 22:20           ` Linus Torvalds
2014-03-03 22:01       ` Linus Torvalds
2014-03-03 22:10         ` Al Viro
2014-03-03 22:22           ` Linus Torvalds
2014-03-06 15:03     ` Michael Kerrisk (man-pages)
2014-03-07  3:38       ` Yongzhi Pan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.