LKML Archive on lore.kernel.org
 help / color / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: lkml <linux-kernel@vger.kernel.org>
Cc: mtk.manpages@gmail.com, Miklos Szeredi <miklos@szeredi.hu>,
	"Theodore T'so" <tytso@mit.edu>, Christoph Hellwig <hch@lst.de>,
	Chris Mason <clm@fb.com>, Dave Chinner <david@fromorbit.com>,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	"J. Bruce Fields" <bfields@citi.umich.edu>,
	Yongzhi Pan <panyongzhi@gmail.com>,
	"Michael Kerrisk (gmail)" <mtk.manpages@gmail.com>
Subject: Update of file offset on write() etc. is non-atomic with I/O
Date: Mon, 17 Feb 2014 16:41:37 +0100
Message-ID: <53022DB1.4070805@gmail.com> (raw)

Hello all,

A note from Yongzhi Pan about some of my own code led me to dig deeper 
and discover behavior that is surprising and also seems to be a 
fairly clear violation of POSIX requirements.

It appears that write() (and, presumably read() and other similar 
system calls) are not atomic with respect to performing I/O and 
updating the file offset behavior. 

The problem can be demonstrated using the program below.
That program takes three arguments:

$ ./multi_writer num-children num-blocks block-size > somefile

It creates 'num-children' children, each of which writes 'num-blocks'
blocks of 'block-size' bytes to standard output; for my experiments,
stdout is redirected to a file. After all children have finished,
the parent inspects the size of the file written on stdout, calculates
the expected size of the file, and displays these two values, and 
their difference on stderr.

Some observations:

* All children inherit the stdout file descriptor from the parent;
  thus the FDs refer to the same open file description, and therefore
  share the file offset.

* When I run this on a multi-CPU BSD systems, I get the expected result:

$ ./multi_writer 10 10000 1000 > g 2> run.log
$ ls -l g
-rw-------  1 mkerrisk  users  100000000 Jan 17 07:34 g

* Someone else tested this code for me on a Solaris system, and also got 
  the expected result.

* On Linux, by contrast, we see behavior such as the following:
  
$ ./multi_writer 10 10000 1000 > g 
Expected file size:  100000000
Actual file size:     16323000
Difference:           83677000
$ ls -l g
-rw-r--r--. 1 mtk mtk 16323000 Feb 17 16:05 g

Summary of the above output: some children are overwriting the output
of other children because output is not atomic with respect to updates
to the file offset.

For reference, POSIX.1-2008/SUSv4 Section XSI 2.9.7 says:

[[
2.9.7 Thread Interactions with Regular File Operations

All of the following functions shall be atomic with respect to each other 
in the effects specified in POSIX.1-2008 when they operate on regular 
files or symbolic links:


chmod()
...
pread()
read()
...
readv()
pwrite()
...
write()
writev()
 

If two threads each call one of these functions, each call shall either 
see all of the specified effects of the other call, or none of them.
]]

(POSIX.1-2001 has similar text.)

This text is in one of the Threads sections, but it applies equally 
to threads in different processes as to threads in the same process.

I've tested the code below on ext4, XFS, and BtrFS, on kernel 3.12 and a 
number of other recent kernels, all with similar results, which suggests 
the result is in the VFS layer. (Can it really be so simple as no locking
around pieces such as 

                loff_t pos = file_pos_read(f.file);
                ret = vfs_write(f.file, buf, count, &pos);
                if (ret >= 0)
                        file_pos_write(f.file, pos);

in fs/read_write.c?)

I discovered this behavior after Yongzhi Pan reported some unexpected
behavior in some of my code that forked to create a parent and
child that wrote to the same file. In some cases, expected output
was not appearing. In other words, after a fork(), and in the absence 
of any other synchronization technique, a parent and a child cannot 
safely write to the same file descriptor without risking overwriting 
each other's output. But POSIX requires this, and other systems seem
to guarantee it.

Am I correct to think there's a kernel problem here?

Thanks,

Michael

===

/* multi_writer.c 
*/

#include <sys/wait.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/fcntl.h>
#include <sys/stat.h>
#include <string.h>
#include <errno.h>

typedef enum { FALSE, TRUE } Boolean;

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

#define fatal(msg)      do { fprintf(stderr, "%s\n", msg); \
                             exit(EXIT_FAILURE); } while (0)

#define usageErr(msg, progName)        \
                        do { fprintf(stderr, "Usage: "); \
                             fprintf(stderr, msg, progName); \
                             exit(EXIT_FAILURE); } while (0)

int
main(int argc, char *argv[])
{
    char *buf;
    int j, k, nblocks, nchildren;
    size_t blocksize;
    struct stat sb;
//  int nchanges;
//  off_t pos;
    long long expected;

    if (argc < 4 || strcmp(argv[1], "--help") == 0)
        usageErr("%s num-children num-blocks block-size [O_APPEND-flag]\n",
                argv[0]);

    nblocks = atoi(argv[2]);
    blocksize = atoi(argv[3]);

    buf = malloc(blocksize + 1);
    if (buf == NULL)
        errExit("malloc");

    /* If a fourth command-line argument is specified, set the O_APPEND
       flag on stdout */

    if (argc > 4)
        if (fcntl(STDOUT_FILENO, F_SETFL, O_APPEND) == -1)
            errExit("fcntl-F_SETFL");

    nchildren = atoi(argv[1]);

    /* Create child processes that write blocks to stdout */

    for (j = 0; j < nchildren; j++) {
        switch(fork()) {
        case -1:
            errExit("fork");

        case 0: /* Each child writes nblocks * blocksize bytes to stdout */
//          nchanges = 0;

	    /* Put something distinctive in each child's buffer (in case
	       we want to analyze byte sequences in the output) */

	    for (k = 0; k < blocksize; k++)
		buf[k] = 'a' + getpid() % 26;

            for (k = 0; k < nblocks; k++) {
//              if (k > 0 && pos != lseek(STDOUT_FILENO, 0, SEEK_END))
//                  nchanges++;
                if (write(STDOUT_FILENO, buf, blocksize) != blocksize)
                    fatal("write");
//              pos = lseek(STDOUT_FILENO, 0, SEEK_END);
            }

//          fprintf(stderr, "%ld: nchanges = %d\n",
//                  (long) getpid(), nchanges);
            exit(EXIT_SUCCESS);

        default:
            break;      /* Parent falls through to create next child */
        }
    }

    /* Wait for all children to terminate */

    while (wait(NULL) > 0)
        continue;

    /* Compare final length of file against expected size */

    if (fstat(STDOUT_FILENO, &sb) == -1)
        errExit("fstat");

    expected =  blocksize * nblocks * nchildren;
    fprintf(stderr, "Expected file size: %10lld\n", expected);
    fprintf(stderr, "Actual file size:   %10lld\n", (long long) sb.st_size);
    fprintf(stderr, "Difference:         %10lld\n", expected - sb.st_size);

    exit(EXIT_SUCCESS);
}


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

             reply index

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-17 15:41 Michael Kerrisk (man-pages) [this message]
2014-02-18 13:00 ` Michael Kerrisk
2014-02-20 17:14 ` Linus Torvalds
2014-03-03 17:36   ` Linus Torvalds
2014-03-03 21:45     ` Al Viro
2014-03-03 21:56       ` Linus Torvalds
2014-03-03 22:09         ` Al Viro
2014-03-03 22:20           ` Linus Torvalds
2014-03-03 22:01       ` Linus Torvalds
2014-03-03 22:10         ` Al Viro
2014-03-03 22:22           ` Linus Torvalds
2014-03-06 15:03     ` Michael Kerrisk (man-pages)
2014-03-07  3:38       ` Yongzhi Pan
     [not found] <a8df285f-de7f-4a3a-9a19-e0ad07ab3a5c@blur>
2014-02-20 18:15 ` Zuckerman, Boris
2014-02-20 18:29   ` Al Viro
2014-02-21  6:01     ` Michael Kerrisk (man-pages)
2014-02-23  1:18       ` Kevin Easton
2014-02-23  7:38         ` Michael Kerrisk (man-pages)
2014-03-03 21:03 George Spelvin
2014-03-03 21:26 ` Al Viro
2014-03-03 21:52   ` Linus Torvalds
2014-03-03 22:01     ` Al Viro
2014-03-03 22:17       ` Linus Torvalds
2014-03-03 23:28         ` Al Viro
2014-03-03 23:34           ` Linus Torvalds
2014-03-03 23:42             ` Al Viro
2014-03-03 23:59               ` Linus Torvalds
2014-03-04  0:23                 ` Al Viro
2014-03-04  0:42                   ` Linus Torvalds
2014-03-04  1:05                     ` Al Viro
2014-03-04 20:00                       ` Al Viro
2014-03-04 21:17                         ` Linus Torvalds
2014-03-05  0:04                           ` Al Viro
2014-03-10 15:55                             ` Al Viro
2014-03-03 22:55     ` Linus Torvalds
2014-03-03 23:23       ` Linus Torvalds
2014-03-03 23:39         ` Al Viro
2014-03-03 23:54           ` Linus Torvalds
2014-03-03 23:54           ` Al Viro
2014-03-04 20:11           ` Cedric Blancher
2014-03-04  0:07     ` George Spelvin
2014-05-04  7:04 ` Michael Kerrisk

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53022DB1.4070805@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=bfields@citi.umich.edu \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=panyongzhi@gmail.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox