All of lore.kernel.org
 help / color / mirror / Atom feed
* Corruption with O_DIRECT and unaligned user buffers
@ 2008-11-14 17:04 Tim LaBerge
  2008-11-19  4:25   ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Tim LaBerge @ 2008-11-14 17:04 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4243 bytes --]

The man page for open(2) states the following:

 O_DIRECT (Since Linux 2.6.10)
              Try to minimize cache effects of the I/O to and from this  
file.
              In  general  this  will degrade performance, but it is 
useful in
              special situations, such  as  when  applications  do  
their  own
              caching.   File I/O is done directly to/from user space 
buffers.
              The I/O is synchronous, that is, at the completion of a  
read(2)
              or write(2), data is guaranteed to have been transferred.  
Under
              Linux 2.4 transfer sizes, and the alignment of user  
buffer  and
              file  offset  must all be multiples of the logical block 
size of
              the file system.  Under Linux 2.6 alignment to  512-byte  
bound-
              aries suffices.

However, it appears that data corruption may occur when a multithreaded
process reads into a non-page size aligned user buffer. A test program which
reliably reproduces the problem on ext3 and xfs is attached.

The program creates, patterns, reads, and verify a series of files.

In the read phase, a file is opened with O_DIRECT n times, where n is the
number of cpu's. A single buffer large enough to contain the file is 
allocated
and patterned with data not found in any of the files. The alignment of the
buffer is controlled by a command line option.

Each file is read in parallel by n threads, where n is the number of cpu's.
Thread 0 reads the first page of data from the file into the first page
of the buffer, thread 1 reads the second page of data in to the second 
page of
the buffer, and so on.  Thread n - 1 reads the remainder of the file 
into the
remainder of the buffer.

After a thread reads data into the buffer, it immediately verifies that the
contents of the buffer are correct. If the buffer contains corrupt data, the
thread dumps the data surrounding the corruption and calls abort(). 
Otherwise,
the thread exits.

Crucially, before the reader threads are dispatched, another thread is 
started
which calls fork()/msleep() in a loop until all reads are completed. The 
child
created by fork() does nothing but call exit(0).

A command line option controls whether the buffer is aligned.  In the 
case where
the buffer is aligned on a page boundary, all is well. In the case where the
buffer is aligned on a page + 512 byte offset, corruption is seen 
frequently.

I believe that what is happening is that in the direct IO path, because the
user's buffer is not aligned, some user pages are being mapped twice. When a
fork() happens in between the calls to map the page, the page will be 
marked as
COW. When the second map happens (via get_user_pages()), a new physical page
will be allocated and copied.

Thus, there is a race between the completion of the first read from disk 
(and
write to the user page) and get_user_pages() mapping the page for the second
time. If the write does not complete before the page is copied, the user 
will
see stale data in the first 512 bytes of this page of their buffer. Indeed,
this is corruption most frequently seen. (It's also possible for the 
race to be
lost the other way, so that the last 3584 bytes of the page are stale.)

The attached program dma_thread.c (which is a heavily modified version of a
program provided by a customer seeing this problem) reliably reproduces the
problem on any multicore linux machine on both ext3 and xfs, although any
filesystem using the generic blockdev_direct_IO() routine is probably
vulnerable.

I've seen a few threads that mention the potential for this kind of 
problem, but no
definitive solution or workaround (other than "Don't do that").

Thanks,

Tim LaBerge

-----------------------------------------------------------
The information contained in this transmission may be 
confidential. Any disclosure, copying, or further 
distribution of confidential information is not permitted 
unless such privilege is explicitly granted in writing by 
Quantum Corporation. Furthermore, Quantum Corporation is not 
responsible for the proper and complete transmission of the 
substance of this communication or for any delay in its 
receipt.
------------------------------------------------------------

[-- Attachment #2: dma_thread.c --]
[-- Type: text/x-csrc, Size: 6582 bytes --]

/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024) 
#define READSIZE  (1024*1024)

#define FILENAME    "test_%.04d.tmp"
#define FILECOUNT   100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE   4096

#define true	1
#define false	0

typedef int bool;

bool	done	= false;
int	workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
    fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
		    "\nWith no arguments, generate test files and exit.\n"
		    "-h Display this help and exit.\n"
		    "-a align read buffer to offset <alignment>.\n"
		    "-w number of worker threads, 2 (default) to 256,\n"
		    "   defaults to number of cores.\n\n"

		    "Run first with no arguments to generate files.\n"
		    "Then run with -a <alignment> = 512  or 0. \n");
}

typedef struct {
    pthread_t	    tid;
    int		    worker_number;
    int		    fd;
    int		    offset;
    int		    length;
    int		    pattern;
    unsigned char  *buffer;
} worker_t;


void *worker_thread(void * arg)
{
    int		    bytes_read;
    int		    i,k;
    worker_t	   *worker  = (worker_t *) arg;
    int		    offset  = worker->offset;
    int		    fd	    = worker->fd;
    unsigned char  *buffer  = worker->buffer;
    int		    pattern = worker->pattern;
    int		    length  = worker->length;
    
    if (lseek(fd, offset, SEEK_SET) < 0) {
	fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", 
			offset, fd, strerror(errno));
	exit(1);
    }

    bytes_read = read(fd, buffer, length);
    if (bytes_read != length) {
	fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", 
			fd, bytes_read, strerror(errno));
	exit(1);
    }

    /* Corruption check */
    for (i = 0; i < length; i++) {
	if (buffer[i] != pattern) {
	    printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
	    printf("Data dump starting at 0x%.06x:\n", i - 8);
	    printf("Expect 0x%x followed by 0x%x:\n",
		    pattern, PATTERN);

	    for (k = 0; k < 16; k++) {
		printf("%02x ", buffer[i - 8 + k]);
		if (k == 7) {
		    printf("\n");
		}       
	    }

	    printf("\n");
	    abort();
	}
    }

    return 0;
}

void *fork_thread (void *arg) 
{
    pid_t pid;

    while (!done) {
	pid = fork();
	if (pid == 0) {
	    exit(0);
	} else if (pid < 0) {
	    fprintf(stderr, "Failed to fork child.\n");
	    exit(1);
	} 
	waitpid(pid, NULL, 0 );
	usleep(100);
    }

    return NULL;

}

int main(int argc, char *argv[])
{
    unsigned char  *buffer = NULL;
    char	    filename[1024];
    int		    fd;
    bool	    dowrite = true;
    pthread_t	    fork_tid;
    int		    c, n, j;
    worker_t	   *worker;
    int		    align = 0;
    int		    offset, rc;

    workers = sysconf(_SC_NPROCESSORS_ONLN);

    while ((c = getopt(argc, argv, "a:hw:")) != -1) {
	switch (c) {
	case 'a':
	    align = atoi(optarg);
	    if (align < 0 || align > PAGE_SIZE) {
		printf("Bad alignment %d.\n", align);
		exit(1);
	    }
	    dowrite = false;
	    break;

	case 'h':
	    usage();
	    exit(0);
	    break;

	case 'w':
	    workers = atoi(optarg);
	    if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
		fprintf(stderr, "Worker count %d not between "
				"%d and %d, inclusive.\n",
				workers, MIN_WORKERS, MAX_WORKERS);
		usage();
		exit(1);
	    }
	    dowrite = false;
	    break;

	default:
	    usage();
	    exit(1);
	}
    }

    if (argc > 1 && (optind < argc)) {
	fprintf(stderr, "Bad command line.\n");
	usage();
	exit(1);
    }

    if (dowrite) {

	buffer = malloc(FILESIZE);
	if (buffer == NULL) {
	    fprintf(stderr, "Failed to malloc write buffer.\n");
	    exit(1);
	}

	for (n = 1; n <= FILECOUNT; n++) {
	    sprintf(filename, FILENAME, n);
	    fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
	    if (fd < 0) {
		printf("create failed(%s): %s.\n", filename, strerror(errno));
		exit(1);
	    }
	    memset(buffer, n, FILESIZE);
	    printf("Writing file %s.\n", filename);
	    if (write(fd, buffer, FILESIZE) != FILESIZE) {
		printf("write failed (%s)\n", filename);
	    }

	    close(fd);
	    fd = -1;
	}

	free(buffer);
	buffer = NULL;

	printf("done\n");
	exit(0);
    }

    printf("Using %d workers.\n", workers);

    worker = malloc(workers * sizeof(worker_t));
    if (worker == NULL) {
	fprintf(stderr, "Failed to malloc worker array.\n");
	exit(1);
    }

    for (j = 0; j < workers; j++) {
	worker[j].worker_number = j;
    }

    printf("Using alignment %d.\n", align);
    
    posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
    printf("Read buffer: %p.\n", buffer);
    for (n = 1; n <= FILECOUNT; n++) {

	sprintf(filename, FILENAME, n);
	for (j = 0; j < workers; j++) {
	    if ((worker[j].fd = open(filename,  O_RDONLY|O_DIRECT)) < 0) {
		fprintf(stderr, "Failed to open %s: %s.\n",
				filename, strerror(errno));
		exit(1);
	    }

	    worker[j].pattern = n;
	}

	printf("Reading file %d.\n", n);

	for (offset = 0; offset < FILESIZE; offset += READSIZE) {
	    memset(buffer, PATTERN, READSIZE + align);
	    for (j = 0; j < workers; j++) {
		worker[j].offset = offset + j * PAGE_SIZE;
		worker[j].buffer = buffer + align + j * PAGE_SIZE;
		worker[j].length = PAGE_SIZE;
	    }
	    /* The final worker reads whatever is left over. */
	    worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

	    done = 0;

	    rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Can't create fork thread: %s.\n", 
				strerror(rc));
		exit(1);
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_create(&worker[j].tid, 
				    NULL, 
				    worker_thread, 
				    worker + j);
		if (rc != 0) {
		    fprintf(stderr, "Can't create worker thread %d: %s.\n", 
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_join(worker[j].tid, NULL);
		if (rc != 0) {
		    fprintf(stderr, "Failed to join worker thread %d: %s.\n",
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    /* Let the fork thread know it's ok to exit */
	    done = 1;

	    rc = pthread_join(fork_tid, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Failed to join fork thread: %s.\n",
				strerror(rc));
		exit(1);
	    }
	}

	/* Close the fd's for the next file. */
	for (j = 0; j < workers; j++) {
	    close(worker[j].fd);
	}
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-11-14 17:04 Corruption with O_DIRECT and unaligned user buffers Tim LaBerge
@ 2008-11-19  4:25   ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2008-11-19  4:25 UTC (permalink / raw)
  To: Tim LaBerge, Arcangeli, Andrea; +Cc: linux-mm, linux-fsdevel

On Saturday 15 November 2008 04:04, Tim LaBerge wrote:

> However, it appears that data corruption may occur when a multithreaded
> process reads into a non-page size aligned user buffer. A test program
> which reliably reproduces the problem on ext3 and xfs is attached.
>
> The program creates, patterns, reads, and verify a series of files.
>
> In the read phase, a file is opened with O_DIRECT n times, where n is the
> number of cpu's. A single buffer large enough to contain the file is
> allocated
> and patterned with data not found in any of the files. The alignment of the
> buffer is controlled by a command line option.
>
> Each file is read in parallel by n threads, where n is the number of cpu's.
> Thread 0 reads the first page of data from the file into the first page
> of the buffer, thread 1 reads the second page of data in to the second
> page of
> the buffer, and so on.  Thread n - 1 reads the remainder of the file
> into the
> remainder of the buffer.
>
> After a thread reads data into the buffer, it immediately verifies that the
> contents of the buffer are correct. If the buffer contains corrupt data,
> the thread dumps the data surrounding the corruption and calls abort().
> Otherwise,
> the thread exits.
>
> Crucially, before the reader threads are dispatched, another thread is
> started
> which calls fork()/msleep() in a loop until all reads are completed. The
> child
> created by fork() does nothing but call exit(0).
>
> A command line option controls whether the buffer is aligned.  In the
> case where
> the buffer is aligned on a page boundary, all is well. In the case where
> the buffer is aligned on a page + 512 byte offset, corruption is seen
> frequently.
>
> I believe that what is happening is that in the direct IO path, because the
> user's buffer is not aligned, some user pages are being mapped twice. When
> a fork() happens in between the calls to map the page, the page will be
> marked as
> COW. When the second map happens (via get_user_pages()), a new physical
> page will be allocated and copied.
>
> Thus, there is a race between the completion of the first read from disk
> (and
> write to the user page) and get_user_pages() mapping the page for the
> second time. If the write does not complete before the page is copied, the
> user will
> see stale data in the first 512 bytes of this page of their buffer. Indeed,
> this is corruption most frequently seen. (It's also possible for the
> race to be
> lost the other way, so that the last 3584 bytes of the page are stale.)
>
> The attached program dma_thread.c (which is a heavily modified version of a
> program provided by a customer seeing this problem) reliably reproduces the
> problem on any multicore linux machine on both ext3 and xfs, although any
> filesystem using the generic blockdev_direct_IO() routine is probably
> vulnerable.
>
> I've seen a few threads that mention the potential for this kind of
> problem, but no
> definitive solution or workaround (other than "Don't do that").

I think your analysis is correct. It is in the same class of problems
that Andrea identified with fork and COW vs get_user_pages().

(I'm sorry Andrea for being really slow in participating in that thread,
I've just been spending some time tinkering and thinking, but I'll
reply soon...)

The solution either involves synchronising forks and get_user_pages,
or probably better, to do copy on fork rather than COW in the case
that we detect a page is subject to get_user_pages. The trick is in
the details :)

Thanks for the test program though, that's something I hadn't actually
written myself yet so that's really useful.

For the moment (and previous kernels up to now), I guess you have to
be careful about fork and get_user_pages, unfortunately.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-11-19  4:25   ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2008-11-19  4:25 UTC (permalink / raw)
  To: Tim LaBerge, Arcangeli, Andrea; +Cc: linux-mm, linux-fsdevel

On Saturday 15 November 2008 04:04, Tim LaBerge wrote:

> However, it appears that data corruption may occur when a multithreaded
> process reads into a non-page size aligned user buffer. A test program
> which reliably reproduces the problem on ext3 and xfs is attached.
>
> The program creates, patterns, reads, and verify a series of files.
>
> In the read phase, a file is opened with O_DIRECT n times, where n is the
> number of cpu's. A single buffer large enough to contain the file is
> allocated
> and patterned with data not found in any of the files. The alignment of the
> buffer is controlled by a command line option.
>
> Each file is read in parallel by n threads, where n is the number of cpu's.
> Thread 0 reads the first page of data from the file into the first page
> of the buffer, thread 1 reads the second page of data in to the second
> page of
> the buffer, and so on.  Thread n - 1 reads the remainder of the file
> into the
> remainder of the buffer.
>
> After a thread reads data into the buffer, it immediately verifies that the
> contents of the buffer are correct. If the buffer contains corrupt data,
> the thread dumps the data surrounding the corruption and calls abort().
> Otherwise,
> the thread exits.
>
> Crucially, before the reader threads are dispatched, another thread is
> started
> which calls fork()/msleep() in a loop until all reads are completed. The
> child
> created by fork() does nothing but call exit(0).
>
> A command line option controls whether the buffer is aligned.  In the
> case where
> the buffer is aligned on a page boundary, all is well. In the case where
> the buffer is aligned on a page + 512 byte offset, corruption is seen
> frequently.
>
> I believe that what is happening is that in the direct IO path, because the
> user's buffer is not aligned, some user pages are being mapped twice. When
> a fork() happens in between the calls to map the page, the page will be
> marked as
> COW. When the second map happens (via get_user_pages()), a new physical
> page will be allocated and copied.
>
> Thus, there is a race between the completion of the first read from disk
> (and
> write to the user page) and get_user_pages() mapping the page for the
> second time. If the write does not complete before the page is copied, the
> user will
> see stale data in the first 512 bytes of this page of their buffer. Indeed,
> this is corruption most frequently seen. (It's also possible for the
> race to be
> lost the other way, so that the last 3584 bytes of the page are stale.)
>
> The attached program dma_thread.c (which is a heavily modified version of a
> program provided by a customer seeing this problem) reliably reproduces the
> problem on any multicore linux machine on both ext3 and xfs, although any
> filesystem using the generic blockdev_direct_IO() routine is probably
> vulnerable.
>
> I've seen a few threads that mention the potential for this kind of
> problem, but no
> definitive solution or workaround (other than "Don't do that").

I think your analysis is correct. It is in the same class of problems
that Andrea identified with fork and COW vs get_user_pages().

(I'm sorry Andrea for being really slow in participating in that thread,
I've just been spending some time tinkering and thinking, but I'll
reply soon...)

The solution either involves synchronising forks and get_user_pages,
or probably better, to do copy on fork rather than COW in the case
that we detect a page is subject to get_user_pages. The trick is in
the details :)

Thanks for the test program though, that's something I hadn't actually
written myself yet so that's really useful.

For the moment (and previous kernels up to now), I guess you have to
be careful about fork and get_user_pages, unfortunately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-11-19  4:25   ` Nick Piggin
@ 2008-11-19  6:52     ` Nick Piggin
  -1 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2008-11-19  6:52 UTC (permalink / raw)
  To: Tim LaBerge; +Cc: Arcangeli, Andrea, linux-mm, linux-fsdevel

On Wednesday 19 November 2008 15:25, Nick Piggin wrote:

> For the moment (and previous kernels up to now), I guess you have to
> be careful about fork and get_user_pages, unfortunately.

I'm reminded by someone wishing to remain anonymous that one of
the ways that we can "be careful", is to use MADV_DONTFORK for
ranges that may be under direct IO.

Not a beautiful solution, but it might work.

If you need some sharing of that region between parent and child,
you could alternatively use a shared mapping (eg. MAP_ANONYMOUS |
MAP_SHARED) and avoid the COW issue completely.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-11-19  6:52     ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2008-11-19  6:52 UTC (permalink / raw)
  To: Tim LaBerge; +Cc: Arcangeli, Andrea, linux-mm, linux-fsdevel

On Wednesday 19 November 2008 15:25, Nick Piggin wrote:

> For the moment (and previous kernels up to now), I guess you have to
> be careful about fork and get_user_pages, unfortunately.

I'm reminded by someone wishing to remain anonymous that one of
the ways that we can "be careful", is to use MADV_DONTFORK for
ranges that may be under direct IO.

Not a beautiful solution, but it might work.

If you need some sharing of that region between parent and child,
you could alternatively use a shared mapping (eg. MAP_ANONYMOUS |
MAP_SHARED) and avoid the COW issue completely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-11-19  4:25   ` Nick Piggin
@ 2008-11-19 16:58     ` Andrea Arcangeli
  -1 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-19 16:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Tim LaBerge, linux-mm, linux-fsdevel

On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
> The solution either involves synchronising forks and get_user_pages,
> or probably better, to do copy on fork rather than COW in the case
> that we detect a page is subject to get_user_pages. The trick is in
> the details :)

We already have a patch that works.

The only trouble here is get_user_pages_fast, it breaks the fix for
fork, the current ksm (that is safe against get_user_pages but can't
be safe against get_user_pages_fast) and even migrate.c
memory-corrupts against O_DIRECT after the introduction of
get_user_pages_fast.

So I recommend focusing on how to fix get_user_pages_fast for any of
the 3 broken pieces, then hopefully the same fix will work for the
other two.

fork is special in that it even breaks against get_user_pages but
again we've a fix for that. The only problem without a solution is how
to serialize against get_user_pages_fast. A brlock was my proposal,
not nice but still better than backing out get_user_pages_fast.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-11-19 16:58     ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-19 16:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Tim LaBerge, linux-mm, linux-fsdevel

On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
> The solution either involves synchronising forks and get_user_pages,
> or probably better, to do copy on fork rather than COW in the case
> that we detect a page is subject to get_user_pages. The trick is in
> the details :)

We already have a patch that works.

The only trouble here is get_user_pages_fast, it breaks the fix for
fork, the current ksm (that is safe against get_user_pages but can't
be safe against get_user_pages_fast) and even migrate.c
memory-corrupts against O_DIRECT after the introduction of
get_user_pages_fast.

So I recommend focusing on how to fix get_user_pages_fast for any of
the 3 broken pieces, then hopefully the same fix will work for the
other two.

fork is special in that it even breaks against get_user_pages but
again we've a fix for that. The only problem without a solution is how
to serialize against get_user_pages_fast. A brlock was my proposal,
not nice but still better than backing out get_user_pages_fast.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-11-19 16:58     ` Andrea Arcangeli
  (?)
@ 2008-12-18 15:29     ` Andrea Arcangeli
  2008-12-19  2:21       ` KAMEZAWA Hiroyuki
                         ` (3 more replies)
  -1 siblings, 4 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-18 15:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Tim LaBerge, linux-mm, linux-fsdevel

On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
> > The solution either involves synchronising forks and get_user_pages,
> > or probably better, to do copy on fork rather than COW in the case
> > that we detect a page is subject to get_user_pages. The trick is in
> > the details :)
> 
> We already have a patch that works.

Here it is below, had to produce it for rhel (so far it was only in
our minds and it didn't float around just yet).

So this fixes the reported bug for me, Tim can you check to be sure?
Very convenient that I didn't need to write the reproducer myself,
this was a very nice testcase thanks a lot, probably worth adding to
ltp ;).

Problem this only fixes it for rhel and other kernels that don't have
get_user_pages_fast yet. You really have to think at some way to
serialize get_user_pages_fast for this and ksm. get_user_pages_fast
makes it a unfixable bug to mark any anon pte from readwrite to
readonly when there could be O_DIRECT on it, this has to be solved
sooner or later...

So last detail, I take it as safe not to check if the pte is writeable
after handle_mm_fault returns as the new address space is private and
the page fault couldn't possibly race with anything (i.e. pte_same is
guaranteed to succeed). For the mainline version we can remove the
page lock and replace with smb_wmb in add_to_swap_cache and smp_rmb in
the page_count/PG_swapcache read to remove that trylockpage. Given
smp_wmb is barrier() it should worth it.

If you see something wrong during review below let me know, this is a
tricky place to change. Note the ->open done after copy_page_range
returns in fork, do_wp_page will run and copy anon pages before ->open
is run on the child vma, given those are anon pages I think it should
work but said that I doubt I exercised in practice any device driver
open method there yet. Thanks!

------
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: fork-o_direct-race

Think a thread writing constantly to the last 512bytes of a page, while another
thread read and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected because a third
unrelated thread forks off a child.

This fixes it by never wprotecting anon ptes if there can be any direct I/O in
flight to the page, and by instantiating a readonly pte and triggering a COW in
the child. The only trouble here are O_DIRECT reads (writes to memory, read
from disk). Checking the page_count under the PT lock guarantees no
get_user_pages could be running under us because if somebody wants to write to
the page, it has to break any cow first and that requires taking the PT lock in
follow_page before increasing the page count.

The COW triggered inside fork will run while the parent pte is read-write, this
is not usual but that's ok as it's only a page copy and it doesn't modify the
page contents.

In the long term there should be a smp_wmb() in between page_cache_get and
SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
PageSwapCache and the page_count() to remove the trylock op.

Fixed version of original patch from Nick Piggin.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
--- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
+++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
@@ -368,7 +368,7 @@
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff -ur rhel-5.2/mm/memory.c x/mm/memory.c
--- rhel-5.2/mm/memory.c	2008-07-10 17:26:44.000000000 +0200
+++ x/mm/memory.c	2008-12-18 15:51:17.000000000 +0100
@@ -426,7 +426,7 @@
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -434,6 +434,7 @@
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int forcecow = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -464,15 +465,6 @@
 	}
 
 	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = *src_pte;
-	}
-
-	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
 	 */
@@ -484,11 +476,34 @@
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
+		if (is_cow_mapping(vm_flags) && PageAnon(page)) {
+			if (unlikely(TestSetPageLocked(page)))
+				forcecow = 1;
+			else {
+				if (unlikely(page_count(page) !=
+					     page_mapcount(page)
+					     + !!PageSwapCache(page)))
+					forcecow = 1;
+				unlock_page(page);
+			}
+		}
 		rss[!!PageAnon(page)]++;
 	}
 
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child
+	 */
+	if (is_cow_mapping(vm_flags)) {
+		if (!forcecow)
+			ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte = pte_wrprotect(pte);
+	}
+
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return forcecow;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -499,8 +514,10 @@
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int forcecow;
 
 again:
+	forcecow = 0;
 	rss[1] = rss[0] = 0;
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -510,6 +527,9 @@
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
 	do {
+		if (forcecow)
+			break;
+
 		/*
 		 * We are holding two locks at this point - either of them
 		 * could generate latencies in another task on another CPU.
@@ -525,7 +545,7 @@
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -534,6 +554,10 @@
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
 	cond_resched();
+	if (forcecow)
+		if (__handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) &
+		    (VM_FAULT_OOM | VM_FAULT_SIGBUS))
+			return -ENOMEM;
 	if (addr != end)
 		goto again;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-18 15:29     ` Andrea Arcangeli
@ 2008-12-19  2:21       ` KAMEZAWA Hiroyuki
  2008-12-19  5:06           ` KAMEZAWA Hiroyuki
  2008-12-19  6:34       ` KOSAKI Motohiro
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-19  2:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Thu, 18 Dec 2008 16:29:52 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> @@ -484,11 +476,34 @@
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page);
> +		if (is_cow_mapping(vm_flags) && PageAnon(page)) {
> +			if (unlikely(TestSetPageLocked(page)))
> +				forcecow = 1;
> +			else {
> +				if (unlikely(page_count(page) !=
> +					     page_mapcount(page)
> +					     + !!PageSwapCache(page)))
> +					forcecow = 1;
> +				unlock_page(page);
> +			}
> +		}
>  		rss[!!PageAnon(page)]++;
>  	}
 - Why do you check only Anon rather than all MAP_PRIVATE mappings ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  2:21       ` KAMEZAWA Hiroyuki
@ 2008-12-19  5:06           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-19  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Fri, 19 Dec 2008 11:21:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 18 Dec 2008 16:29:52 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > @@ -484,11 +476,34 @@
> >  	if (page) {
> >  		get_page(page);
> >  		page_dup_rmap(page);
> > +		if (is_cow_mapping(vm_flags) && PageAnon(page)) {
> > +			if (unlikely(TestSetPageLocked(page)))
> > +				forcecow = 1;
> > +			else {
> > +				if (unlikely(page_count(page) !=
> > +					     page_mapcount(page)
> > +					     + !!PageSwapCache(page)))
> > +					forcecow = 1;
> > +				unlock_page(page);
> > +			}
> > +		}
> >  		rss[!!PageAnon(page)]++;
> >  	}
>  - Why do you check only Anon rather than all MAP_PRIVATE mappings ?
> 
Sorry, ignore this quesiton.

-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19  5:06           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-19  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Fri, 19 Dec 2008 11:21:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 18 Dec 2008 16:29:52 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > @@ -484,11 +476,34 @@
> >  	if (page) {
> >  		get_page(page);
> >  		page_dup_rmap(page);
> > +		if (is_cow_mapping(vm_flags) && PageAnon(page)) {
> > +			if (unlikely(TestSetPageLocked(page)))
> > +				forcecow = 1;
> > +			else {
> > +				if (unlikely(page_count(page) !=
> > +					     page_mapcount(page)
> > +					     + !!PageSwapCache(page)))
> > +					forcecow = 1;
> > +				unlock_page(page);
> > +			}
> > +		}
> >  		rss[!!PageAnon(page)]++;
> >  	}
>  - Why do you check only Anon rather than all MAP_PRIVATE mappings ?
> 
Sorry, ignore this quesiton.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-18 15:29     ` Andrea Arcangeli
  2008-12-19  2:21       ` KAMEZAWA Hiroyuki
@ 2008-12-19  6:34       ` KOSAKI Motohiro
  2008-12-20 16:02           ` Andrea Arcangeli
  2008-12-19  7:19       ` KAMEZAWA Hiroyuki
  2008-12-19 11:51         ` Li Zefan
  3 siblings, 1 reply; 32+ messages in thread
From: KOSAKI Motohiro @ 2008-12-19  6:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

Hi

I don't undestand your patch yet. just dumb question.


> Problem this only fixes it for rhel and other kernels that don't have
> get_user_pages_fast yet. You really have to think at some way to
> serialize get_user_pages_fast for this and ksm. get_user_pages_fast
> makes it a unfixable bug to mark any anon pte from readwrite to
> readonly when there could be O_DIRECT on it, this has to be solved
> sooner or later...

I'm confused.

I think gup_pte_range() doesn't change pte attribute.
Could you explain why get_user_pages_fast() is evil?


> So last detail, I take it as safe not to check if the pte is writeable
> after handle_mm_fault returns as the new address space is private and
> the page fault couldn't possibly race with anything (i.e. pte_same is
> guaranteed to succeed). For the mainline version we can remove the
> page lock and replace with smb_wmb in add_to_swap_cache and smp_rmb in
> the page_count/PG_swapcache read to remove that trylockpage. Given
> smp_wmb is barrier() it should worth it.

Why rhel can't use memory barrier?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-18 15:29     ` Andrea Arcangeli
  2008-12-19  2:21       ` KAMEZAWA Hiroyuki
  2008-12-19  6:34       ` KOSAKI Motohiro
@ 2008-12-19  7:19       ` KAMEZAWA Hiroyuki
  2008-12-19  7:44         ` Li Zefan
  2008-12-20 15:55           ` Andrea Arcangeli
  2008-12-19 11:51         ` Li Zefan
  3 siblings, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-19  7:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Thu, 18 Dec 2008 16:29:52 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
> > On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
> > > The solution either involves synchronising forks and get_user_pages,
> > > or probably better, to do copy on fork rather than COW in the case
> > > that we detect a page is subject to get_user_pages. The trick is in
> > > the details :)
> > 

> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: fork-o_direct-race
> 
> Think a thread writing constantly to the last 512bytes of a page, while another
> thread read and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads, the very moment we mark any pte wrprotected because a third
> unrelated thread forks off a child.
> 
> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
> flight to the page, and by instantiating a readonly pte and triggering a COW in
> the child. The only trouble here are O_DIRECT reads (writes to memory, read
> from disk). Checking the page_count under the PT lock guarantees no
> get_user_pages could be running under us because if somebody wants to write to
> the page, it has to break any cow first and that requires taking the PT lock in
> follow_page before increasing the page count.
> 
> The COW triggered inside fork will run while the parent pte is read-write, this
> is not usual but that's ok as it's only a page copy and it doesn't modify the
> page contents.
> 
> In the long term there should be a smp_wmb() in between page_cache_get and
> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
> PageSwapCache and the page_count() to remove the trylock op.
> 
> Fixed version of original patch from Nick Piggin.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Confirmed this fixes the problem.

Hmm, but, fork() gets slower. 

Result of cost-of-fork() on ia64.
==
  size of memory  before  after
  Anon=1M   	, 0.07ms, 0.08ms
  Anon=10M  	, 0.17ms, 0.22ms
  Anon=100M 	, 1.15ms, 1.64ms
  Anon=1000M	, 11.5ms, 15.821ms
==

fork() cost is 135% when the process has 1G of Anon.

test program is below. (used "/usr/bin/time" for measurement.)
==
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>


int main(int argc, char *argv[])
{
        int size, i, status;
        char *c;

        size = atoi(argv[1]) * 1024 * 1024;
        c = malloc(size);
        memset(c, 0,size);
        for (i = 0; i < 5000; i++) {
                if (!fork()) {
                        exit(0);
                }
                wait(&status);
        }
}
==





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  7:19       ` KAMEZAWA Hiroyuki
@ 2008-12-19  7:44         ` Li Zefan
  2008-12-19  8:45             ` Li Zefan
  2008-12-19 20:27             ` Andrea Arcangeli
  2008-12-20 15:55           ` Andrea Arcangeli
  1 sibling, 2 replies; 32+ messages in thread
From: Li Zefan @ 2008-12-19  7:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Andrea Arcangeli
  Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, FNST-Wang Chen

[-- Attachment #1: Type: text/plain, Size: 2265 bytes --]

KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Dec 2008 16:29:52 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
>> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
>>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
>>>> The solution either involves synchronising forks and get_user_pages,
>>>> or probably better, to do copy on fork rather than COW in the case
>>>> that we detect a page is subject to get_user_pages. The trick is in
>>>> the details :)
> 
>> From: Andrea Arcangeli <aarcange@redhat.com>
>> Subject: fork-o_direct-race
>>
>> Think a thread writing constantly to the last 512bytes of a page, while another
>> thread read and writes to/from the first 512bytes of the page. We can lose
>> O_DIRECT reads, the very moment we mark any pte wrprotected because a third
>> unrelated thread forks off a child.
>>
>> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
>> flight to the page, and by instantiating a readonly pte and triggering a COW in
>> the child. The only trouble here are O_DIRECT reads (writes to memory, read
>> from disk). Checking the page_count under the PT lock guarantees no
>> get_user_pages could be running under us because if somebody wants to write to
>> the page, it has to break any cow first and that requires taking the PT lock in
>> follow_page before increasing the page count.
>>
>> The COW triggered inside fork will run while the parent pte is read-write, this
>> is not usual but that's ok as it's only a page copy and it doesn't modify the
>> page contents.
>>
>> In the long term there should be a smp_wmb() in between page_cache_get and
>> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
>> PageSwapCache and the page_count() to remove the trylock op.
>>
>> Fixed version of original patch from Nick Piggin.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> Confirmed this fixes the problem.
> 

We tested with RHEL 5.2 + patch on i386 using the test program provided by
Tim LaBerge, though the program can pass but sometimes hanged. strace log is
attached, and we'll test it again with LOCKDEP enabled to see if we can get
some other information.

BTW, the patch works fine on IA64.

> Hmm, but, fork() gets slower. 

[-- Attachment #2: strace.log --]
[-- Type: text/x-log, Size: 25241 bytes --]

xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5193
futex(0xb6a18bd8, FUTEX_WAIT, 5192, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5193, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5191, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5200
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5201
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5202
futex(0xb7419bd8, FUTEX_WAIT, 5201, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5202, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5200, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5207
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5208
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5209
futex(0xb6a18bd8, FUTEX_WAIT, 5208, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5207, NULL) = -1 EAGAIN (Resource temporarily unavailable)
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5221
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5222
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5223
futex(0xb7419bd8, FUTEX_WAIT, 5222, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5221, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5228
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5229
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5230
futex(0xb6a18bd8, FUTEX_WAIT, 5229, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5228, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5234
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5235
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5236
futex(0xb7419bd8, FUTEX_WAIT, 5235, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5236, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5234, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5241
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5242
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5243
futex(0xb6a18bd8, FUTEX_WAIT, 5242, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5241, NULL) = 0
close(3)                                = 0
close(4)                                = 0
open("test_0060.tmp", O_RDONLY|O_DIRECT) = 3
open("test_0060.tmp", O_RDONLY|O_DIRECT) = 4
write(1, "Reading file 60.\n", 17Reading file 60.
)      = 17
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5248
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5249
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5250
futex(0xb7419bd8, FUTEX_WAIT, 5249, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5248, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5257
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5258
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5259
futex(0xb6a18bd8, FUTEX_WAIT, 5258, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5259, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5257, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5266
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5267
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5268
futex(0xb7419bd8, FUTEX_WAIT, 5267, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5268, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5266, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5279
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5280
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5281
futex(0xb6a18bd8, FUTEX_WAIT, 5280, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5279, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5288
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5289
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5290
futex(0xb7419bd8, FUTEX_WAIT, 5289, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5290, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5288, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5297
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5298
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5299
futex(0xb6a18bd8, FUTEX_WAIT, 5298, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5299, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5297, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5306
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5307
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5308
futex(0xb7419bd8, FUTEX_WAIT, 5307, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5308, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5306, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5313
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5314
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5315
futex(0xb6a18bd8, FUTEX_WAIT, 5314, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5315, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5313, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5320
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5321
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5322
futex(0xb7419bd8, FUTEX_WAIT, 5321, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5320, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5328
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5329
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5331
futex(0xb6a18bd8, FUTEX_WAIT, 5329, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb7419bd8, FUTEX_WAIT, 5331, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5328, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5337
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5338
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5339
futex(0xb7419bd8, FUTEX_WAIT, 5338, NULL) = 0
futex(0xb6a18bd8, FUTEX_WAIT, 5339, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5337, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5356
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5357
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5358
futex(0xb6a18bd8, FUTEX_WAIT, 5357, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5358, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5356, NULL) = 0
close(3)                                = 0
close(4)                                = 0
open("test_0061.tmp", O_RDONLY|O_DIRECT) = 3
open("test_0061.tmp", O_RDONLY|O_DIRECT) = 4
write(1, "Reading file 61.\n", 17Reading file 61.
)      = 17
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5366
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5367
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5369
futex(0xb7419bd8, FUTEX_WAIT, 5367, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5366, NULL) = 0
clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5372
clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5373
clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5375
futex(0xb6a18bd8, FUTEX_WAIT, 5373, NULL) = 0
futex(0xb7419bd8, FUTEX_WAIT, 5375, NULL) = 0
futex(0xb7e1abd8, FUTEX_WAIT, 5372, NULL


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  7:44         ` Li Zefan
@ 2008-12-19  8:45             ` Li Zefan
  2008-12-19 20:27             ` Andrea Arcangeli
  1 sibling, 0 replies; 32+ messages in thread
From: Li Zefan @ 2008-12-19  8:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm,
	linux-fsdevel, FNST-Wang Chen

Li Zefan wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Thu, 18 Dec 2008 16:29:52 +0100
>> Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>>> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
>>>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
>>>>> The solution either involves synchronising forks and get_user_pages,
>>>>> or probably better, to do copy on fork rather than COW in the case
>>>>> that we detect a page is subject to get_user_pages. The trick is in
>>>>> the details :)
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>> Subject: fork-o_direct-race
>>>
>>> Think a thread writing constantly to the last 512bytes of a page, while another
>>> thread read and writes to/from the first 512bytes of the page. We can lose
>>> O_DIRECT reads, the very moment we mark any pte wrprotected because a third
>>> unrelated thread forks off a child.
>>>
>>> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
>>> flight to the page, and by instantiating a readonly pte and triggering a COW in
>>> the child. The only trouble here are O_DIRECT reads (writes to memory, read
>>> from disk). Checking the page_count under the PT lock guarantees no
>>> get_user_pages could be running under us because if somebody wants to write to
>>> the page, it has to break any cow first and that requires taking the PT lock in
>>> follow_page before increasing the page count.
>>>
>>> The COW triggered inside fork will run while the parent pte is read-write, this
>>> is not usual but that's ok as it's only a page copy and it doesn't modify the
>>> page contents.
>>>
>>> In the long term there should be a smp_wmb() in between page_cache_get and
>>> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
>>> PageSwapCache and the page_count() to remove the trylock op.
>>>
>>> Fixed version of original patch from Nick Piggin.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Confirmed this fixes the problem.
>>
> 
> We tested with RHEL 5.2 + patch on i386 using the test program provided by
> Tim LaBerge, though the program can pass but sometimes hanged. strace log is
> attached, and we'll test it again with LOCKDEP enabled to see if we can get
> some other information.
> 

# ./dma_thread -a 512

Using 2 workers.

Using alignment 512.

Read buffer: 0xb7e4e000.

Reading file 1.

Reading file 2.

...

Reading file 26.

Reading file 27.


(hang here, Ctrl+C can break the process)


And we modified the program to use 'dma_thread -a 512 -w 1', we can still see
hung in a very low frequency.

==============

Here is a snapshop of call trace:

dma_thread    S 00000035  2872 20296   8797         23256       (NOTLB)
       f7018e78 00000046 1f593e7d 00000035 f7018e84 00000002 00000000 00000006 
       f4c35530 f71ac030 1f5a03da 00000035 0000c55d 00000001 f4c3563c c1a80044 
       f7018f04 f7018f1c b7e4cbd8 00000046 00000000 00000002 00000001 7fffffff 
Call Trace:
 [<c061bd10>] schedule_timeout+0x13/0x8c
 [<c043c435>] do_futex+0x1e2/0xb38
 [<c061d316>] _spin_unlock+0x14/0x1c
 [<c0465938>] do_wp_page+0x3fb/0x405
 [<c0466da0>] __handle_mm_fault+0x858/0x8b8
 [<c041e5f3>] default_wake_function+0x0/0xc
 [<c044e32b>] audit_syscall_entry+0x14b/0x17d
 [<c043ce9c>] sys_futex+0x111/0x127
 [<c0408076>] do_syscall_trace+0xab/0xb1
 [<c0404f53>] syscall_call+0x7/0xb
 =======================
dma_thread    S 00000035  3304 23256   8797 23258         20296 (NOTLB)
       f4e24f50 00000046 1ec0e26d 00000035 c073ea10 416db065 00000046 00000003 
       f70ac030 c1b7eab0 1ec7d7c9 00000035 0006f55c 00000001 f70ac13c c1a80044 
       00005ada f4cc0030 00000000 00000246 ffffffff 00000000 00000000 f53acab0 
Call Trace:
 [<c0426b84>] do_wait+0x8b5/0x9a3
 [<c044e32b>] audit_syscall_entry+0x14b/0x17d
 [<c041e5f3>] default_wake_function+0x0/0xc
 [<c0426c99>] sys_wait4+0x27/0x2a
 [<c0426caf>] sys_waitpid+0x13/0x17
 [<c0404f53>] syscall_call+0x7/0xb
 =======================
dma_thread    R running  3412 23258  23256                     (NOTLB)
...
...
Showing all locks held in the system:
4 locks held by kseriod/82:
 #0:  (serio_mutex){--..}, at: [<c059c7f6>] serio_thread+0x13/0x28d
 #1:  (&serio->drv_mutex){--..}, at: [<c059be7e>] serio_connect_driver+0x16/0x2c
 #2:  (psmouse_mutex){--..}, at: [<c05a41ce>] psmouse_connect+0x18/0x211
 #3:  (&ps2_mutex_key){--..}, at: [<c059e4bd>] ps2_command+0x80/0x2dc

=============================================

> BTW, the patch works fine on IA64.
> 
>> Hmm, but, fork() gets slower. 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19  8:45             ` Li Zefan
  0 siblings, 0 replies; 32+ messages in thread
From: Li Zefan @ 2008-12-19  8:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm,
	linux-fsdevel, FNST-Wang Chen

Li Zefan wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Thu, 18 Dec 2008 16:29:52 +0100
>> Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>>> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote:
>>>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote:
>>>>> The solution either involves synchronising forks and get_user_pages,
>>>>> or probably better, to do copy on fork rather than COW in the case
>>>>> that we detect a page is subject to get_user_pages. The trick is in
>>>>> the details :)
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>> Subject: fork-o_direct-race
>>>
>>> Think a thread writing constantly to the last 512bytes of a page, while another
>>> thread read and writes to/from the first 512bytes of the page. We can lose
>>> O_DIRECT reads, the very moment we mark any pte wrprotected because a third
>>> unrelated thread forks off a child.
>>>
>>> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
>>> flight to the page, and by instantiating a readonly pte and triggering a COW in
>>> the child. The only trouble here are O_DIRECT reads (writes to memory, read
>>> from disk). Checking the page_count under the PT lock guarantees no
>>> get_user_pages could be running under us because if somebody wants to write to
>>> the page, it has to break any cow first and that requires taking the PT lock in
>>> follow_page before increasing the page count.
>>>
>>> The COW triggered inside fork will run while the parent pte is read-write, this
>>> is not usual but that's ok as it's only a page copy and it doesn't modify the
>>> page contents.
>>>
>>> In the long term there should be a smp_wmb() in between page_cache_get and
>>> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the
>>> PageSwapCache and the page_count() to remove the trylock op.
>>>
>>> Fixed version of original patch from Nick Piggin.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Confirmed this fixes the problem.
>>
> 
> We tested with RHEL 5.2 + patch on i386 using the test program provided by
> Tim LaBerge, though the program can pass but sometimes hanged. strace log is
> attached, and we'll test it again with LOCKDEP enabled to see if we can get
> some other information.
> 

# ./dma_thread -a 512

Using 2 workers.

Using alignment 512.

Read buffer: 0xb7e4e000.

Reading file 1.

Reading file 2.

...

Reading file 26.

Reading file 27.


(hang here, Ctrl+C can break the process)


And we modified the program to use 'dma_thread -a 512 -w 1', we can still see
hung in a very low frequency.

==============

Here is a snapshop of call trace:

dma_thread    S 00000035  2872 20296   8797         23256       (NOTLB)
       f7018e78 00000046 1f593e7d 00000035 f7018e84 00000002 00000000 00000006 
       f4c35530 f71ac030 1f5a03da 00000035 0000c55d 00000001 f4c3563c c1a80044 
       f7018f04 f7018f1c b7e4cbd8 00000046 00000000 00000002 00000001 7fffffff 
Call Trace:
 [<c061bd10>] schedule_timeout+0x13/0x8c
 [<c043c435>] do_futex+0x1e2/0xb38
 [<c061d316>] _spin_unlock+0x14/0x1c
 [<c0465938>] do_wp_page+0x3fb/0x405
 [<c0466da0>] __handle_mm_fault+0x858/0x8b8
 [<c041e5f3>] default_wake_function+0x0/0xc
 [<c044e32b>] audit_syscall_entry+0x14b/0x17d
 [<c043ce9c>] sys_futex+0x111/0x127
 [<c0408076>] do_syscall_trace+0xab/0xb1
 [<c0404f53>] syscall_call+0x7/0xb
 =======================
dma_thread    S 00000035  3304 23256   8797 23258         20296 (NOTLB)
       f4e24f50 00000046 1ec0e26d 00000035 c073ea10 416db065 00000046 00000003 
       f70ac030 c1b7eab0 1ec7d7c9 00000035 0006f55c 00000001 f70ac13c c1a80044 
       00005ada f4cc0030 00000000 00000246 ffffffff 00000000 00000000 f53acab0 
Call Trace:
 [<c0426b84>] do_wait+0x8b5/0x9a3
 [<c044e32b>] audit_syscall_entry+0x14b/0x17d
 [<c041e5f3>] default_wake_function+0x0/0xc
 [<c0426c99>] sys_wait4+0x27/0x2a
 [<c0426caf>] sys_waitpid+0x13/0x17
 [<c0404f53>] syscall_call+0x7/0xb
 =======================
dma_thread    R running  3412 23258  23256                     (NOTLB)
...
...
Showing all locks held in the system:
4 locks held by kseriod/82:
 #0:  (serio_mutex){--..}, at: [<c059c7f6>] serio_thread+0x13/0x28d
 #1:  (&serio->drv_mutex){--..}, at: [<c059be7e>] serio_connect_driver+0x16/0x2c
 #2:  (psmouse_mutex){--..}, at: [<c05a41ce>] psmouse_connect+0x18/0x211
 #3:  (&ps2_mutex_key){--..}, at: [<c059e4bd>] ps2_command+0x80/0x2dc

=============================================

> BTW, the patch works fine on IA64.
> 
>> Hmm, but, fork() gets slower. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-18 15:29     ` Andrea Arcangeli
@ 2008-12-19 11:51         ` Li Zefan
  2008-12-19  6:34       ` KOSAKI Motohiro
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Li Zefan @ 2008-12-19 11:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen

> diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> @@ -368,7 +368,7 @@
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, oldmm, mpnt);
> +		retval = copy_page_range(mm, oldmm, tmp);
>  

Could you explain a bit why this change is needed?

Seems this is a revert of the following commit:

commit 0b0db14c536debd92328819fe6c51a49717e8440
Author: Hugh Dickins <hugh@veritas.com>
Date:   Mon Nov 21 21:32:20 2005 -0800

    [PATCH] unpaged: copy_page_range vma

    For copy_one_pte's print_bad_pte to show the task correctly (instead of
    "???"), dup_mmap must pass down parent vma rather than child vma.

    Signed-off-by: Hugh Dickins <hugh@veritas.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/kernel/fork.c b/kernel/fork.c
index e0d0b77..1c1cf8d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
                rb_parent = &tmp->vm_rb;

                mm->map_count++;
-               retval = copy_page_range(mm, oldmm, tmp);
+               retval = copy_page_range(mm, oldmm, mpnt);

                if (tmp->vm_ops && tmp->vm_ops->open)
                        tmp->vm_ops->open(tmp);


>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19 11:51         ` Li Zefan
  0 siblings, 0 replies; 32+ messages in thread
From: Li Zefan @ 2008-12-19 11:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen

> diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> @@ -368,7 +368,7 @@
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, oldmm, mpnt);
> +		retval = copy_page_range(mm, oldmm, tmp);
>  

Could you explain a bit why this change is needed?

Seems this is a revert of the following commit:

commit 0b0db14c536debd92328819fe6c51a49717e8440
Author: Hugh Dickins <hugh@veritas.com>
Date:   Mon Nov 21 21:32:20 2005 -0800

    [PATCH] unpaged: copy_page_range vma

    For copy_one_pte's print_bad_pte to show the task correctly (instead of
    "???"), dup_mmap must pass down parent vma rather than child vma.

    Signed-off-by: Hugh Dickins <hugh@veritas.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/kernel/fork.c b/kernel/fork.c
index e0d0b77..1c1cf8d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
                rb_parent = &tmp->vm_rb;

                mm->map_count++;
-               retval = copy_page_range(mm, oldmm, tmp);
+               retval = copy_page_range(mm, oldmm, mpnt);

                if (tmp->vm_ops && tmp->vm_ops->open)
                        tmp->vm_ops->open(tmp);


>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19 11:51         ` Li Zefan
@ 2008-12-19 12:14           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 32+ messages in thread
From: KOSAKI Motohiro @ 2008-12-19 12:14 UTC (permalink / raw)
  To: Li Zefan
  Cc: kosaki.motohiro, Andrea Arcangeli, Nick Piggin, Tim LaBerge,
	linux-mm, linux-fsdevel, Wang Chen

> > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> > --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> > +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> > @@ -368,7 +368,7 @@
> >  		rb_parent = &tmp->vm_rb;
> >  
> >  		mm->map_count++;
> > -		retval = copy_page_range(mm, oldmm, mpnt);
> > +		retval = copy_page_range(mm, oldmm, tmp);
> >  
> 
> Could you explain a bit why this change is needed?

maybe..

__handle_mm_fault() change rmap of passwd vma.
we need to parent process has original page, child process has new page.
then we need child vma.


> Seems this is a revert of the following commit:
> 
> commit 0b0db14c536debd92328819fe6c51a49717e8440
> Author: Hugh Dickins <hugh@veritas.com>
> Date:   Mon Nov 21 21:32:20 2005 -0800
> 
>     [PATCH] unpaged: copy_page_range vma
> 
>     For copy_one_pte's print_bad_pte to show the task correctly (instead of
>     "???"), dup_mmap must pass down parent vma rather than child vma.

I think you are right.
This patch reintroduce the same problem.

end up, print_bad_pte() need parent vma.
__handle_mm_fault() need child vma.

corrent?




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19 12:14           ` KOSAKI Motohiro
  0 siblings, 0 replies; 32+ messages in thread
From: KOSAKI Motohiro @ 2008-12-19 12:14 UTC (permalink / raw)
  To: Li Zefan
  Cc: kosaki.motohiro, Andrea Arcangeli, Nick Piggin, Tim LaBerge,
	linux-mm, linux-fsdevel, Wang Chen

> > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> > --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> > +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> > @@ -368,7 +368,7 @@
> >  		rb_parent = &tmp->vm_rb;
> >  
> >  		mm->map_count++;
> > -		retval = copy_page_range(mm, oldmm, mpnt);
> > +		retval = copy_page_range(mm, oldmm, tmp);
> >  
> 
> Could you explain a bit why this change is needed?

maybe..

__handle_mm_fault() change rmap of passwd vma.
we need to parent process has original page, child process has new page.
then we need child vma.


> Seems this is a revert of the following commit:
> 
> commit 0b0db14c536debd92328819fe6c51a49717e8440
> Author: Hugh Dickins <hugh@veritas.com>
> Date:   Mon Nov 21 21:32:20 2005 -0800
> 
>     [PATCH] unpaged: copy_page_range vma
> 
>     For copy_one_pte's print_bad_pte to show the task correctly (instead of
>     "???"), dup_mmap must pass down parent vma rather than child vma.

I think you are right.
This patch reintroduce the same problem.

end up, print_bad_pte() need parent vma.
__handle_mm_fault() need child vma.

corrent?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19 11:51         ` Li Zefan
  (?)
  (?)
@ 2008-12-19 12:58         ` Hugh Dickins
  -1 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2008-12-19 12:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm,
	linux-fsdevel, Wang Chen

On Fri, 19 Dec 2008, Li Zefan wrote:
> > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> > --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> > +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> > @@ -368,7 +368,7 @@
> >  		rb_parent = &tmp->vm_rb;
> >  
> >  		mm->map_count++;
> > -		retval = copy_page_range(mm, oldmm, mpnt);
> > +		retval = copy_page_range(mm, oldmm, tmp);
> >  
> 
> Could you explain a bit why this change is needed?
> 
> Seems this is a revert of the following commit:
> 
> commit 0b0db14c536debd92328819fe6c51a49717e8440
> Author: Hugh Dickins <hugh@veritas.com>
> Date:   Mon Nov 21 21:32:20 2005 -0800
> 
>     [PATCH] unpaged: copy_page_range vma
> 
>     For copy_one_pte's print_bad_pte to show the task correctly (instead of
>     "???"), dup_mmap must pass down parent vma rather than child vma.
> 
>     Signed-off-by: Hugh Dickins <hugh@veritas.com>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e0d0b77..1c1cf8d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>                 rb_parent = &tmp->vm_rb;
> 
>                 mm->map_count++;
> -               retval = copy_page_range(mm, oldmm, tmp);
> +               retval = copy_page_range(mm, oldmm, mpnt);
> 
>                 if (tmp->vm_ops && tmp->vm_ops->open)
>                         tmp->vm_ops->open(tmp);
> 
> 
> >  		if (tmp->vm_ops && tmp->vm_ops->open)
> >  			tmp->vm_ops->open(tmp);


[I'm not finding much time to think about anything at the moment, so
reluctant even to stick my head above the parapet; but this is easy,
and though there might be lots of things I'd dislike about Andrea's
patch if I had time to study it ;-), this certainly isn't one of them.]

This should be a non-issue: although the patch that this reverts was
valid in itself, it arose from my misunderstanding (of the likely
relevance of current->comm in exit_mmap - much more likely to be
relevant than I was thinking at the time) that I forced upon Nick
in print_bad_pte().

And now I've a rewrite of print_bad_pte() queued up in -mm, which
admits that misunderstanding and removes the "???" case: so in 2.6.29
it shouldn't matter if we pass parent or child vma to copy_page_range.
Oh, and it doesn't even matter in 2.6.26 onwards either: they don't
have any calls to print_bad_pte() below copy_page_range().

Though I've not checked whether we might have added some other
dependence on it being parent vma in the meanwhile - that's possible.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  7:44         ` Li Zefan
@ 2008-12-19 20:27             ` Andrea Arcangeli
  2008-12-19 20:27             ` Andrea Arcangeli
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-19 20:27 UTC (permalink / raw)
  To: Li Zefan
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm,
	linux-fsdevel, FNST-Wang Chen

Hello,

On Fri, Dec 19, 2008 at 03:44:09PM +0800, Li Zefan wrote:
> Tim LaBerge, though the program can pass but sometimes hanged. strace log is
> attached, and we'll test it again with LOCKDEP enabled to see if we can get
> some other information.

So my current suggestion on this is to understand why __reclaim_stacks
is not starting with a lll_unlock before the list_for_each runs, I'll
look into this next week if nobody explained it yet ;). Statistically
speaking it's more likely to be the kernel patch to be buggy and this
is likely a faulty theory I know, but it's not impossible that this is
an unrelated bug that was hidden as it required userland
list_del/add/splice to race against the kernel ptep_set_wrprotect
single instruction.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19 20:27             ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-19 20:27 UTC (permalink / raw)
  To: Li Zefan
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm,
	linux-fsdevel, FNST-Wang Chen

Hello,

On Fri, Dec 19, 2008 at 03:44:09PM +0800, Li Zefan wrote:
> Tim LaBerge, though the program can pass but sometimes hanged. strace log is
> attached, and we'll test it again with LOCKDEP enabled to see if we can get
> some other information.

So my current suggestion on this is to understand why __reclaim_stacks
is not starting with a lll_unlock before the list_for_each runs, I'll
look into this next week if nobody explained it yet ;). Statistically
speaking it's more likely to be the kernel patch to be buggy and this
is likely a faulty theory I know, but it's not impossible that this is
an unrelated bug that was hidden as it required userland
list_del/add/splice to race against the kernel ptep_set_wrprotect
single instruction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19 11:51         ` Li Zefan
@ 2008-12-19 20:34           ` Andrea Arcangeli
  -1 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-19 20:34 UTC (permalink / raw)
  To: Li Zefan; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen

On Fri, Dec 19, 2008 at 07:51:49PM +0800, Li Zefan wrote:
> > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> > --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> > +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> > @@ -368,7 +368,7 @@
> >  		rb_parent = &tmp->vm_rb;
> >  
> >  		mm->map_count++;
> > -		retval = copy_page_range(mm, oldmm, mpnt);
> > +		retval = copy_page_range(mm, oldmm, tmp);
> >  
> 
> Could you explain a bit why this change is needed?

This change is needed to pass the child vma (not the parent vma) to
handle_mm_fault. We run handle_mm_fault on the child not on the
parent, so the vma passed to handle_mm_fault has to be the one of the
child obviously. It won't make a difference for the other users of the
vma because both vma are basically the same. Nick did it btw.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-19 20:34           ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-19 20:34 UTC (permalink / raw)
  To: Li Zefan; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen

On Fri, Dec 19, 2008 at 07:51:49PM +0800, Li Zefan wrote:
> > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c
> > --- rhel-5.2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> > +++ x/kernel/fork.c	2008-12-18 15:57:31.000000000 +0100
> > @@ -368,7 +368,7 @@
> >  		rb_parent = &tmp->vm_rb;
> >  
> >  		mm->map_count++;
> > -		retval = copy_page_range(mm, oldmm, mpnt);
> > +		retval = copy_page_range(mm, oldmm, tmp);
> >  
> 
> Could you explain a bit why this change is needed?

This change is needed to pass the child vma (not the parent vma) to
handle_mm_fault. We run handle_mm_fault on the child not on the
parent, so the vma passed to handle_mm_fault has to be the one of the
child obviously. It won't make a difference for the other users of the
vma because both vma are basically the same. Nick did it btw.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  7:19       ` KAMEZAWA Hiroyuki
@ 2008-12-20 15:55           ` Andrea Arcangeli
  2008-12-20 15:55           ` Andrea Arcangeli
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-20 15:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Fri, Dec 19, 2008 at 04:19:11PM +0900, KAMEZAWA Hiroyuki wrote:
> Result of cost-of-fork() on ia64.
> ==
>   size of memory  before  after
>   Anon=1M   	, 0.07ms, 0.08ms
>   Anon=10M  	, 0.17ms, 0.22ms
>   Anon=100M 	, 1.15ms, 1.64ms
>   Anon=1000M	, 11.5ms, 15.821ms
> ==
> 
> fork() cost is 135% when the process has 1G of Anon.

Not sure where the 135% number comes from. The above number shows a
performance decrease of 27% or a time increase of 37% which I hope is
inline with the overhead introduced by the TestSetPageLocked in the
fast path (which I didn't expect to be so bad), but that it's almost
trivial to eliminate with a smb_wmb in add_to_swap_cache and a smb_rmb
in fork. So we'll need to repeat this measurement after replacing the
TestSetPageLocked with smb_rmb.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-20 15:55           ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-20 15:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

On Fri, Dec 19, 2008 at 04:19:11PM +0900, KAMEZAWA Hiroyuki wrote:
> Result of cost-of-fork() on ia64.
> ==
>   size of memory  before  after
>   Anon=1M   	, 0.07ms, 0.08ms
>   Anon=10M  	, 0.17ms, 0.22ms
>   Anon=100M 	, 1.15ms, 1.64ms
>   Anon=1000M	, 11.5ms, 15.821ms
> ==
> 
> fork() cost is 135% when the process has 1G of Anon.

Not sure where the 135% number comes from. The above number shows a
performance decrease of 27% or a time increase of 37% which I hope is
inline with the overhead introduced by the TestSetPageLocked in the
fast path (which I didn't expect to be so bad), but that it's almost
trivial to eliminate with a smb_wmb in add_to_swap_cache and a smb_rmb
in fork. So we'll need to repeat this measurement after replacing the
TestSetPageLocked with smb_rmb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2008-12-19  6:34       ` KOSAKI Motohiro
@ 2008-12-20 16:02           ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-20 16:02 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

Hello!

On Fri, Dec 19, 2008 at 03:34:20PM +0900, KOSAKI Motohiro wrote:
> I think gup_pte_range() doesn't change pte attribute.
> Could you explain why get_user_pages_fast() is evil?

It's evil because it was assumed that by just relying on the
local_irq_disable() to prevent the smp tlb flush IPI to run, it'd be
enough to simulate a 'current' pagetable walk that allowed the current
task to run entirely lockless.

Problem is that by being totally lockless it prevents us to know if a
page is under direct-io or not. And if a page is under direct IO with
writing to memory (reading from memory we cannot care less, it's
always ok) we can't merge pages in ksm or we can't mark the pte
readonly in fork etc... If we do things break. The entirely lockless
(but atomic) pagetable walk done by the cpu is different from gup_fast
because the one done by the cpu will never end up writing to the page
through the pci bus in DMA, so the moment the IPI runs whatever I/O is
interrupted (not the case for gup_fast, when gup_fast returns and the
IPI runs and page is then available for sharing to ksm or pte marked
readonly, the direct DMA is still in flight). That's why gup_fast
*can't* be 100% lockless as today, otherwise it's unfixable and broken
and it's not just ksm. This very O_DIRECT bug in fork is 100%
unfixable without adding some serialization to gup_fast. So my patch
fixes it fully only for kernels before the introduction of gup_fast...

My suggestion is to reintroduced the big reader lock (br_lock) of
2.4 and have gup_fast take the read side of it, and fork/ksm take the
write side. It must no be a write-starving lock like the 2.4 one
though or fork would hang forever on large smp. It should be still
faster than get_user_pages.

> Why rhel can't use memory barrier?

Oh it can, just I didn't implemented immediately as I wanted to ship a
simpler patch first, but given the 27% slowdown measured in later
email, I'll definitely have to replace the TestSetPageLocked with
smb_rmb and see if the introduced overhead goes away.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
@ 2008-12-20 16:02           ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-12-20 16:02 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel

Hello!

On Fri, Dec 19, 2008 at 03:34:20PM +0900, KOSAKI Motohiro wrote:
> I think gup_pte_range() doesn't change pte attribute.
> Could you explain why get_user_pages_fast() is evil?

It's evil because it was assumed that by just relying on the
local_irq_disable() to prevent the smp tlb flush IPI to run, it'd be
enough to simulate a 'current' pagetable walk that allowed the current
task to run entirely lockless.

Problem is that by being totally lockless it prevents us to know if a
page is under direct-io or not. And if a page is under direct IO with
writing to memory (reading from memory we cannot care less, it's
always ok) we can't merge pages in ksm or we can't mark the pte
readonly in fork etc... If we do things break. The entirely lockless
(but atomic) pagetable walk done by the cpu is different from gup_fast
because the one done by the cpu will never end up writing to the page
through the pci bus in DMA, so the moment the IPI runs whatever I/O is
interrupted (not the case for gup_fast, when gup_fast returns and the
IPI runs and page is then available for sharing to ksm or pte marked
readonly, the direct DMA is still in flight). That's why gup_fast
*can't* be 100% lockless as today, otherwise it's unfixable and broken
and it's not just ksm. This very O_DIRECT bug in fork is 100%
unfixable without adding some serialization to gup_fast. So my patch
fixes it fully only for kernels before the introduction of gup_fast...

My suggestion is to reintroduced the big reader lock (br_lock) of
2.4 and have gup_fast take the read side of it, and fork/ksm take the
write side. It must no be a write-starving lock like the 2.4 one
though or fork would hang forever on large smp. It should be still
faster than get_user_pages.

> Why rhel can't use memory barrier?

Oh it can, just I didn't implemented immediately as I wanted to ship a
simpler patch first, but given the 27% slowdown measured in later
email, I'll definitely have to replace the TestSetPageLocked with
smb_rmb and see if the introduced overhead goes away.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2014-07-01  4:18 ` Hugh Dickins
@ 2014-07-02 11:39   ` Xiaoguang Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Xiaoguang Wang @ 2014-07-02 11:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, akpm, mgorman, Andrea Arcangeli, chrubis

Hi,

On 07/01/2014 12:18 PM, Hugh Dickins wrote:
> On Fri, 27 Jun 2014, Xiaoguang Wang wrote:
>> Hi maintainers,
> 
> That's not me, but I'll answer with my opinion.

Sure, thanks, Any opinion or suggestions will be appreciated :)
> 
>>
>> In August 2008, there was a discussion about 'Corruption with O_DIRECT and unaligned user buffers',
>> please have a look at this url: http://thread.gmane.org/gmane.linux.file-systems/27358
> 
> Whereas (now the truth can be told!) "someone wishing to remain anonymous"
> in that thread was indeed me.  Then as now, disinclined to spend time on it.
> 
>>
>> The attached test program written by Tim has been added to LTP, please see this below url:
>> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/io/direct_io/dma_thread_diotest.c
>>
>>
>> Now I tested this program in kernel 3.16.0-rc1+, it seems that the date corruption still exists. Meanwhile
>> there is also such a section in open(2)'s manpage warning that O_DIRECT I/Os should never be run
>> concurrently with the fork(2) system call. Please see below section:
>>
>>     O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer
>>     is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes
>>     memory allocated on the heap and statically allocated buffers).  Any such I/Os, whether  submitted
>>     via an asynchronous I/O interface or from another thread in the process, should be completed before
>>     fork(2) is called.  Failure to do so can result in data corruption and undefined behavior in parent
>>     and child processes.  This restriction does not apply when the memory buffer for  the  O_DIRECT
>>     I/Os  was  created  using shmat(2) or mmap(2) with the MAP_SHARED flag.  Nor does this restriction
>>     apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will
>>     not be available to the child after fork(2).
>>
>> Hmm, so I'd like to know whether you have some plans to fix this bug, or this is not considered as a
>> bug, it's just a programming specification that we should avoid doing fork() while we are having O_DIRECT
>> file operation with non-page aligned IO, thanks.
>>
>> Steps to run this attached program:
>> 1. ./dma_thread  # create temp files
>> 2. ./dma_thread -a 512 -w 8 $ alignment is 512 and create 8 threads.
> 
> I regard it, then and now, as a displeasing limitation;
> but one whose fix would cause more trouble than it's worth.

Yeah, I see. Once Andrea had a patch to fix this, but it would slow down fork().
> 
> I thought we settled long ago on MADV_DONTFORK as an imperfect but
> good enough workaround.  Not everyone will agree.  I certainly have
> no plans to go further myself.
OK, I still want to thanks for your response.

Currently I don't have much knowledge about mm, sorry, so I'd like to know whether someone
has some opinion or plan to fix this issue, thanks.

Regards,
Xiaoguang Wang

> 
> Hugh
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Corruption with O_DIRECT and unaligned user buffers
  2014-06-27  2:08 Xiaoguang Wang
@ 2014-07-01  4:18 ` Hugh Dickins
  2014-07-02 11:39   ` Xiaoguang Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Hugh Dickins @ 2014-07-01  4:18 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: linux-mm, akpm, mgorman, Andrea Arcangeli, Hugh Dickins, chrubis

On Fri, 27 Jun 2014, Xiaoguang Wang wrote:
> Hi maintainers,

That's not me, but I'll answer with my opinion.

> 
> In August 2008, there was a discussion about 'Corruption with O_DIRECT and unaligned user buffers',
> please have a look at this url: http://thread.gmane.org/gmane.linux.file-systems/27358

Whereas (now the truth can be told!) "someone wishing to remain anonymous"
in that thread was indeed me.  Then as now, disinclined to spend time on it.

> 
> The attached test program written by Tim has been added to LTP, please see this below url:
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/io/direct_io/dma_thread_diotest.c
> 
> 
> Now I tested this program in kernel 3.16.0-rc1+, it seems that the date corruption still exists. Meanwhile
> there is also such a section in open(2)'s manpage warning that O_DIRECT I/Os should never be run
> concurrently with the fork(2) system call. Please see below section:
> 
>     O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer
>     is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes
>     memory allocated on the heap and statically allocated buffers).  Any such I/Os, whether  submitted
>     via an asynchronous I/O interface or from another thread in the process, should be completed before
>     fork(2) is called.  Failure to do so can result in data corruption and undefined behavior in parent
>     and child processes.  This restriction does not apply when the memory buffer for  the  O_DIRECT
>     I/Os  was  created  using shmat(2) or mmap(2) with the MAP_SHARED flag.  Nor does this restriction
>     apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will
>     not be available to the child after fork(2).
> 
> Hmm, so I'd like to know whether you have some plans to fix this bug, or this is not considered as a
> bug, it's just a programming specification that we should avoid doing fork() while we are having O_DIRECT
> file operation with non-page aligned IO, thanks.
> 
> Steps to run this attached program:
> 1. ./dma_thread  # create temp files
> 2. ./dma_thread -a 512 -w 8 $ alignment is 512 and create 8 threads.

I regard it, then and now, as a displeasing limitation;
but one whose fix would cause more trouble than it's worth.

I thought we settled long ago on MADV_DONTFORK as an imperfect but
good enough workaround.  Not everyone will agree.  I certainly have
no plans to go further myself.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Corruption with O_DIRECT and unaligned user buffers
@ 2014-06-27  2:08 Xiaoguang Wang
  2014-07-01  4:18 ` Hugh Dickins
  0 siblings, 1 reply; 32+ messages in thread
From: Xiaoguang Wang @ 2014-06-27  2:08 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, mgorman, Andrea Arcangeli, Hugh Dickins, chrubis

[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

Hi maintainers,

In August 2008, there was a discussion about 'Corruption with O_DIRECT and unaligned user buffers',
please have a look at this url: http://thread.gmane.org/gmane.linux.file-systems/27358

The attached test program written by Tim has been added to LTP, please see this below url:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/io/direct_io/dma_thread_diotest.c


Now I tested this program in kernel 3.16.0-rc1+, it seems that the date corruption still exists. Meanwhile
there is also such a section in open(2)'s manpage warning that O_DIRECT I/Os should never be run
concurrently with the fork(2) system call. Please see below section:

    O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer
    is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes
    memory allocated on the heap and statically allocated buffers).  Any such I/Os, whether  submitted
    via an asynchronous I/O interface or from another thread in the process, should be completed before
    fork(2) is called.  Failure to do so can result in data corruption and undefined behavior in parent
    and child processes.  This restriction does not apply when the memory buffer for  the  O_DIRECT
    I/Os  was  created  using shmat(2) or mmap(2) with the MAP_SHARED flag.  Nor does this restriction
    apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will
    not be available to the child after fork(2).

Hmm, so I'd like to know whether you have some plans to fix this bug, or this is not considered as a
bug, it's just a programming specification that we should avoid doing fork() while we are having O_DIRECT
file operation with non-page aligned IO, thanks.

Steps to run this attached program:
1. ./dma_thread  # create temp files
2. ./dma_thread -a 512 -w 8 $ alignment is 512 and create 8 threads.


Regards,
Xiaoguang Wang

[-- Attachment #2: dma_thread.c --]
[-- Type: text/x-csrc, Size: 6582 bytes --]

/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024) 
#define READSIZE  (1024*1024)

#define FILENAME    "test_%.04d.tmp"
#define FILECOUNT   100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE   4096

#define true	1
#define false	0

typedef int bool;

bool	done	= false;
int	workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
    fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
		    "\nWith no arguments, generate test files and exit.\n"
		    "-h Display this help and exit.\n"
		    "-a align read buffer to offset <alignment>.\n"
		    "-w number of worker threads, 2 (default) to 256,\n"
		    "   defaults to number of cores.\n\n"

		    "Run first with no arguments to generate files.\n"
		    "Then run with -a <alignment> = 512  or 0. \n");
}

typedef struct {
    pthread_t	    tid;
    int		    worker_number;
    int		    fd;
    int		    offset;
    int		    length;
    int		    pattern;
    unsigned char  *buffer;
} worker_t;


void *worker_thread(void * arg)
{
    int		    bytes_read;
    int		    i,k;
    worker_t	   *worker  = (worker_t *) arg;
    int		    offset  = worker->offset;
    int		    fd	    = worker->fd;
    unsigned char  *buffer  = worker->buffer;
    int		    pattern = worker->pattern;
    int		    length  = worker->length;
    
    if (lseek(fd, offset, SEEK_SET) < 0) {
	fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", 
			offset, fd, strerror(errno));
	exit(1);
    }

    bytes_read = read(fd, buffer, length);
    if (bytes_read != length) {
	fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", 
			fd, bytes_read, strerror(errno));
	exit(1);
    }

    /* Corruption check */
    for (i = 0; i < length; i++) {
	if (buffer[i] != pattern) {
	    printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
	    printf("Data dump starting at 0x%.06x:\n", i - 8);
	    printf("Expect 0x%x followed by 0x%x:\n",
		    pattern, PATTERN);

	    for (k = 0; k < 16; k++) {
		printf("%02x ", buffer[i - 8 + k]);
		if (k == 7) {
		    printf("\n");
		}       
	    }

	    printf("\n");
	    abort();
	}
    }

    return 0;
}

void *fork_thread (void *arg) 
{
    pid_t pid;

    while (!done) {
	pid = fork();
	if (pid == 0) {
	    exit(0);
	} else if (pid < 0) {
	    fprintf(stderr, "Failed to fork child.\n");
	    exit(1);
	} 
	waitpid(pid, NULL, 0 );
	usleep(100);
    }

    return NULL;

}

int main(int argc, char *argv[])
{
    unsigned char  *buffer = NULL;
    char	    filename[1024];
    int		    fd;
    bool	    dowrite = true;
    pthread_t	    fork_tid;
    int		    c, n, j;
    worker_t	   *worker;
    int		    align = 0;
    int		    offset, rc;

    workers = sysconf(_SC_NPROCESSORS_ONLN);

    while ((c = getopt(argc, argv, "a:hw:")) != -1) {
	switch (c) {
	case 'a':
	    align = atoi(optarg);
	    if (align < 0 || align > PAGE_SIZE) {
		printf("Bad alignment %d.\n", align);
		exit(1);
	    }
	    dowrite = false;
	    break;

	case 'h':
	    usage();
	    exit(0);
	    break;

	case 'w':
	    workers = atoi(optarg);
	    if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
		fprintf(stderr, "Worker count %d not between "
				"%d and %d, inclusive.\n",
				workers, MIN_WORKERS, MAX_WORKERS);
		usage();
		exit(1);
	    }
	    dowrite = false;
	    break;

	default:
	    usage();
	    exit(1);
	}
    }

    if (argc > 1 && (optind < argc)) {
	fprintf(stderr, "Bad command line.\n");
	usage();
	exit(1);
    }

    if (dowrite) {

	buffer = malloc(FILESIZE);
	if (buffer == NULL) {
	    fprintf(stderr, "Failed to malloc write buffer.\n");
	    exit(1);
	}

	for (n = 1; n <= FILECOUNT; n++) {
	    sprintf(filename, FILENAME, n);
	    fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
	    if (fd < 0) {
		printf("create failed(%s): %s.\n", filename, strerror(errno));
		exit(1);
	    }
	    memset(buffer, n, FILESIZE);
	    printf("Writing file %s.\n", filename);
	    if (write(fd, buffer, FILESIZE) != FILESIZE) {
		printf("write failed (%s)\n", filename);
	    }

	    close(fd);
	    fd = -1;
	}

	free(buffer);
	buffer = NULL;

	printf("done\n");
	exit(0);
    }

    printf("Using %d workers.\n", workers);

    worker = malloc(workers * sizeof(worker_t));
    if (worker == NULL) {
	fprintf(stderr, "Failed to malloc worker array.\n");
	exit(1);
    }

    for (j = 0; j < workers; j++) {
	worker[j].worker_number = j;
    }

    printf("Using alignment %d.\n", align);
    
    posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
    printf("Read buffer: %p.\n", buffer);
    for (n = 1; n <= FILECOUNT; n++) {

	sprintf(filename, FILENAME, n);
	for (j = 0; j < workers; j++) {
	    if ((worker[j].fd = open(filename,  O_RDONLY|O_DIRECT)) < 0) {
		fprintf(stderr, "Failed to open %s: %s.\n",
				filename, strerror(errno));
		exit(1);
	    }

	    worker[j].pattern = n;
	}

	printf("Reading file %d.\n", n);

	for (offset = 0; offset < FILESIZE; offset += READSIZE) {
	    memset(buffer, PATTERN, READSIZE + align);
	    for (j = 0; j < workers; j++) {
		worker[j].offset = offset + j * PAGE_SIZE;
		worker[j].buffer = buffer + align + j * PAGE_SIZE;
		worker[j].length = PAGE_SIZE;
	    }
	    /* The final worker reads whatever is left over. */
	    worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

	    done = 0;

	    rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Can't create fork thread: %s.\n", 
				strerror(rc));
		exit(1);
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_create(&worker[j].tid, 
				    NULL, 
				    worker_thread, 
				    worker + j);
		if (rc != 0) {
		    fprintf(stderr, "Can't create worker thread %d: %s.\n", 
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_join(worker[j].tid, NULL);
		if (rc != 0) {
		    fprintf(stderr, "Failed to join worker thread %d: %s.\n",
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    /* Let the fork thread know it's ok to exit */
	    done = 1;

	    rc = pthread_join(fork_tid, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Failed to join fork thread: %s.\n",
				strerror(rc));
		exit(1);
	    }
	}

	/* Close the fd's for the next file. */
	for (j = 0; j < workers; j++) {
	    close(worker[j].fd);
	}
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-07-02 11:42 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-14 17:04 Corruption with O_DIRECT and unaligned user buffers Tim LaBerge
2008-11-19  4:25 ` Nick Piggin
2008-11-19  4:25   ` Nick Piggin
2008-11-19  6:52   ` Nick Piggin
2008-11-19  6:52     ` Nick Piggin
2008-11-19 16:58   ` Andrea Arcangeli
2008-11-19 16:58     ` Andrea Arcangeli
2008-12-18 15:29     ` Andrea Arcangeli
2008-12-19  2:21       ` KAMEZAWA Hiroyuki
2008-12-19  5:06         ` KAMEZAWA Hiroyuki
2008-12-19  5:06           ` KAMEZAWA Hiroyuki
2008-12-19  6:34       ` KOSAKI Motohiro
2008-12-20 16:02         ` Andrea Arcangeli
2008-12-20 16:02           ` Andrea Arcangeli
2008-12-19  7:19       ` KAMEZAWA Hiroyuki
2008-12-19  7:44         ` Li Zefan
2008-12-19  8:45           ` Li Zefan
2008-12-19  8:45             ` Li Zefan
2008-12-19 20:27           ` Andrea Arcangeli
2008-12-19 20:27             ` Andrea Arcangeli
2008-12-20 15:55         ` Andrea Arcangeli
2008-12-20 15:55           ` Andrea Arcangeli
2008-12-19 11:51       ` Li Zefan
2008-12-19 11:51         ` Li Zefan
2008-12-19 12:14         ` KOSAKI Motohiro
2008-12-19 12:14           ` KOSAKI Motohiro
2008-12-19 12:58         ` Hugh Dickins
2008-12-19 20:34         ` Andrea Arcangeli
2008-12-19 20:34           ` Andrea Arcangeli
2014-06-27  2:08 Xiaoguang Wang
2014-07-01  4:18 ` Hugh Dickins
2014-07-02 11:39   ` Xiaoguang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.