linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* open(2) says O_DIRECT works on 512 byte boundries?
@ 2009-01-28 21:33 Greg KH
  2009-01-29  0:41 ` Robert Hancock
  2009-01-29  5:13 ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 25+ messages in thread
From: Greg KH @ 2009-01-28 21:33 UTC (permalink / raw)
  To: mtk.manpages; +Cc: linux-man, linux-kernel

In looking at open(2), it says that O_DIRECT works on 512 byte boundries
with the 2.6 kernel release:
	Under Linux 2.4, transfer sizes, and the alignment of the user
	buffer and  the file offset must all be multiples of the logical
	block size of the file system.  Under Linux 2.6, alignment  to
	512-byte  boundaries suffices.

However if you try to access an O_DIRECT opened file with a buffer that
is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
read.)

Is this just a mistake in the documentation?  Or am I reading it
incorrectly?

I have a test program that shows this if anyone wants it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-28 21:33 open(2) says O_DIRECT works on 512 byte boundries? Greg KH
@ 2009-01-29  0:41 ` Robert Hancock
       [not found]   ` <20090129011758.GA26534@kroah.com>
  2009-01-29  5:13 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 25+ messages in thread
From: Robert Hancock @ 2009-01-29  0:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-man, linux-kernel

Greg KH wrote:
> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> with the 2.6 kernel release:
> 	Under Linux 2.4, transfer sizes, and the alignment of the user
> 	buffer and  the file offset must all be multiples of the logical
> 	block size of the file system.  Under Linux 2.6, alignment  to
> 	512-byte  boundaries suffices.
> 
> However if you try to access an O_DIRECT opened file with a buffer that
> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> read.)
> 
> Is this just a mistake in the documentation?  Or am I reading it
> incorrectly?
> 
> I have a test program that shows this if anyone wants it.

Well, it sounds like a bug to me.. even if it's not supported, if you do 
such an access, surely the kernel should detect that and return EINVAL 
or something rather than reading corrupted data..


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
       [not found]   ` <20090129011758.GA26534@kroah.com>
@ 2009-01-29  2:59     ` Michael Kerrisk
  2009-01-29  3:13       ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Kerrisk @ 2009-01-29  2:59 UTC (permalink / raw)
  To: Greg KH
  Cc: Robert Hancock, Greg KH,
	public-mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	public-linux-man-u79uwXL29TY76Z2rM5mHXA,
	public-linux-kernel-u79uwXL29TY76Z2rM5mHXA




On Thu, Jan 29, 2009 at 2:17 PM, Greg KH <greg@kroah.com> wrote:
>
>
>
> On Wed, Jan 28, 2009 at 06:41:49PM -0600, Robert Hancock wrote:
>>
>>
>> Greg KH wrote:
>>> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
>>> with the 2.6 kernel release:
>>>      Under Linux 2.4, transfer sizes, and the alignment of the user
>>>      buffer and  the file offset must all be multiples of the logical
>>>      block size of the file system.  Under Linux 2.6, alignment  to
>>>      512-byte  boundaries suffices.
>>> However if you try to access an O_DIRECT opened file with a buffer that
>>> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
>>> read.)
>>> Is this just a mistake in the documentation?  Or am I reading it
>>> incorrectly?
>>> I have a test program that shows this if anyone wants it.
>>
>> Well, it sounds like a bug to me.. even if it's not supported, if you do
>> such an access, surely the kernel should detect that and return EINVAL or
>> something rather than reading corrupted data..
>
> It doesn't.  It says the read is successful, yet the data is not really
> read into the buffer.  Portions of it is, but not the amount we asked
> for.

Greg,

Can you post your test program?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-29  2:59     ` Michael Kerrisk
@ 2009-01-29  3:13       ` Greg KH
  2009-01-29 15:40         ` Jeff Moyer
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2009-01-29  3:13 UTC (permalink / raw)
  To: mtk.manpages; +Cc: Robert Hancock, linux-man, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]

On Thu, Jan 29, 2009 at 03:59:12PM +1300, Michael Kerrisk wrote:
> On Thu, Jan 29, 2009 at 2:17 PM, Greg KH <greg@kroah.com> wrote:
> >
> >
> >
> > On Wed, Jan 28, 2009 at 06:41:49PM -0600, Robert Hancock wrote:
> >>
> >>
> >> Greg KH wrote:
> >>> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> >>> with the 2.6 kernel release:
> >>>      Under Linux 2.4, transfer sizes, and the alignment of the user
> >>>      buffer and  the file offset must all be multiples of the logical
> >>>      block size of the file system.  Under Linux 2.6, alignment  to
> >>>      512-byte  boundaries suffices.
> >>> However if you try to access an O_DIRECT opened file with a buffer that
> >>> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> >>> read.)
> >>> Is this just a mistake in the documentation?  Or am I reading it
> >>> incorrectly?
> >>> I have a test program that shows this if anyone wants it.
> >>
> >> Well, it sounds like a bug to me.. even if it's not supported, if you do
> >> such an access, surely the kernel should detect that and return EINVAL or
> >> something rather than reading corrupted data..
> >
> > It doesn't.  It says the read is successful, yet the data is not really
> > read into the buffer.  Portions of it is, but not the amount we asked
> > for.
> 
> Greg,
> 
> Can you post your test program?

Sure, here it is.  I'm still not quite sure it is valid, but at first
glance it seems to be.

Run it once with no arguments and all of the files will be created.
Then run it again with no offset being asked for:
	./dma_thread -a 0
then with an offset:
	./dma_thread -a 512

The second one breaks.

thanks,

greg k-h

[-- Attachment #2: dma_thread.c --]
[-- Type: text/x-csrc, Size: 6582 bytes --]

/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024) 
#define READSIZE  (1024*1024)

#define FILENAME    "test_%.04d.tmp"
#define FILECOUNT   100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE   4096

#define true	1
#define false	0

typedef int bool;

bool	done	= false;
int	workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
    fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
		    "\nWith no arguments, generate test files and exit.\n"
		    "-h Display this help and exit.\n"
		    "-a align read buffer to offset <alignment>.\n"
		    "-w number of worker threads, 2 (default) to 256,\n"
		    "   defaults to number of cores.\n\n"

		    "Run first with no arguments to generate files.\n"
		    "Then run with -a <alignment> = 512  or 0. \n");
}

typedef struct {
    pthread_t	    tid;
    int		    worker_number;
    int		    fd;
    int		    offset;
    int		    length;
    int		    pattern;
    unsigned char  *buffer;
} worker_t;


void *worker_thread(void * arg)
{
    int		    bytes_read;
    int		    i,k;
    worker_t	   *worker  = (worker_t *) arg;
    int		    offset  = worker->offset;
    int		    fd	    = worker->fd;
    unsigned char  *buffer  = worker->buffer;
    int		    pattern = worker->pattern;
    int		    length  = worker->length;
    
    if (lseek(fd, offset, SEEK_SET) < 0) {
	fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", 
			offset, fd, strerror(errno));
	exit(1);
    }

    bytes_read = read(fd, buffer, length);
    if (bytes_read != length) {
	fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", 
			fd, bytes_read, strerror(errno));
	exit(1);
    }

    /* Corruption check */
    for (i = 0; i < length; i++) {
	if (buffer[i] != pattern) {
	    printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
	    printf("Data dump starting at 0x%.06x:\n", i - 8);
	    printf("Expect 0x%x followed by 0x%x:\n",
		    pattern, PATTERN);

	    for (k = 0; k < 16; k++) {
		printf("%02x ", buffer[i - 8 + k]);
		if (k == 7) {
		    printf("\n");
		}       
	    }

	    printf("\n");
	    abort();
	}
    }

    return 0;
}

void *fork_thread (void *arg) 
{
    pid_t pid;

    while (!done) {
	pid = fork();
	if (pid == 0) {
	    exit(0);
	} else if (pid < 0) {
	    fprintf(stderr, "Failed to fork child.\n");
	    exit(1);
	} 
	waitpid(pid, NULL, 0 );
	usleep(100);
    }

    return NULL;

}

int main(int argc, char *argv[])
{
    unsigned char  *buffer = NULL;
    char	    filename[1024];
    int		    fd;
    bool	    dowrite = true;
    pthread_t	    fork_tid;
    int		    c, n, j;
    worker_t	   *worker;
    int		    align = 0;
    int		    offset, rc;

    workers = sysconf(_SC_NPROCESSORS_ONLN);

    while ((c = getopt(argc, argv, "a:hw:")) != -1) {
	switch (c) {
	case 'a':
	    align = atoi(optarg);
	    if (align < 0 || align > PAGE_SIZE) {
		printf("Bad alignment %d.\n", align);
		exit(1);
	    }
	    dowrite = false;
	    break;

	case 'h':
	    usage();
	    exit(0);
	    break;

	case 'w':
	    workers = atoi(optarg);
	    if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
		fprintf(stderr, "Worker count %d not between "
				"%d and %d, inclusive.\n",
				workers, MIN_WORKERS, MAX_WORKERS);
		usage();
		exit(1);
	    }
	    dowrite = false;
	    break;

	default:
	    usage();
	    exit(1);
	}
    }

    if (argc > 1 && (optind < argc)) {
	fprintf(stderr, "Bad command line.\n");
	usage();
	exit(1);
    }

    if (dowrite) {

	buffer = malloc(FILESIZE);
	if (buffer == NULL) {
	    fprintf(stderr, "Failed to malloc write buffer.\n");
	    exit(1);
	}

	for (n = 1; n <= FILECOUNT; n++) {
	    sprintf(filename, FILENAME, n);
	    fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
	    if (fd < 0) {
		printf("create failed(%s): %s.\n", filename, strerror(errno));
		exit(1);
	    }
	    memset(buffer, n, FILESIZE);
	    printf("Writing file %s.\n", filename);
	    if (write(fd, buffer, FILESIZE) != FILESIZE) {
		printf("write failed (%s)\n", filename);
	    }

	    close(fd);
	    fd = -1;
	}

	free(buffer);
	buffer = NULL;

	printf("done\n");
	exit(0);
    }

    printf("Using %d workers.\n", workers);

    worker = malloc(workers * sizeof(worker_t));
    if (worker == NULL) {
	fprintf(stderr, "Failed to malloc worker array.\n");
	exit(1);
    }

    for (j = 0; j < workers; j++) {
	worker[j].worker_number = j;
    }

    printf("Using alignment %d.\n", align);
    
    posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
    printf("Read buffer: %p.\n", buffer);
    for (n = 1; n <= FILECOUNT; n++) {

	sprintf(filename, FILENAME, n);
	for (j = 0; j < workers; j++) {
	    if ((worker[j].fd = open(filename,  O_RDONLY|O_DIRECT)) < 0) {
		fprintf(stderr, "Failed to open %s: %s.\n",
				filename, strerror(errno));
		exit(1);
	    }

	    worker[j].pattern = n;
	}

	printf("Reading file %d.\n", n);

	for (offset = 0; offset < FILESIZE; offset += READSIZE) {
	    memset(buffer, PATTERN, READSIZE + align);
	    for (j = 0; j < workers; j++) {
		worker[j].offset = offset + j * PAGE_SIZE;
		worker[j].buffer = buffer + align + j * PAGE_SIZE;
		worker[j].length = PAGE_SIZE;
	    }
	    /* The final worker reads whatever is left over. */
	    worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

	    done = 0;

	    rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Can't create fork thread: %s.\n", 
				strerror(rc));
		exit(1);
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_create(&worker[j].tid, 
				    NULL, 
				    worker_thread, 
				    worker + j);
		if (rc != 0) {
		    fprintf(stderr, "Can't create worker thread %d: %s.\n", 
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_join(worker[j].tid, NULL);
		if (rc != 0) {
		    fprintf(stderr, "Failed to join worker thread %d: %s.\n",
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    /* Let the fork thread know it's ok to exit */
	    done = 1;

	    rc = pthread_join(fork_tid, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Failed to join fork thread: %s.\n",
				strerror(rc));
		exit(1);
	    }
	}

	/* Close the fd's for the next file. */
	for (j = 0; j < workers; j++) {
	    close(worker[j].fd);
	}
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-28 21:33 open(2) says O_DIRECT works on 512 byte boundries? Greg KH
  2009-01-29  0:41 ` Robert Hancock
@ 2009-01-29  5:13 ` KAMEZAWA Hiroyuki
  2009-01-29  7:10   ` KOSAKI Motohiro
  1 sibling, 1 reply; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-29  5:13 UTC (permalink / raw)
  To: Greg KH; +Cc: mtk.manpages, linux-man, linux-kernel

On Wed, 28 Jan 2009 13:33:22 -0800
Greg KH <greg@kroah.com> wrote:

> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> with the 2.6 kernel release:
> 	Under Linux 2.4, transfer sizes, and the alignment of the user
> 	buffer and  the file offset must all be multiples of the logical
> 	block size of the file system.  Under Linux 2.6, alignment  to
> 	512-byte  boundaries suffices.
> 
> However if you try to access an O_DIRECT opened file with a buffer that
> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> read.)
> 

IIUC, it's not related to 512bytes boundary. Just a race between
direct-io v.s. copy-on-write. Copy-on-Write while reading a page via DIO
is a problem.

Maybe it's true that if buffer is aligned to page size, no copy-on-write will
happen in usual program. But assuming HugeTLB page, which does Copy-on-Write,
data corruption will happen again. HugeTLB aligned buffer is nonsense.

Thanks,
-Kame


> Is this just a mistake in the documentation?  Or am I reading it
> incorrectly?
> 
> I have a test program that shows this if anyone wants it.
> 
> thanks,
> 
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-29  5:13 ` KAMEZAWA Hiroyuki
@ 2009-01-29  7:10   ` KOSAKI Motohiro
  2009-01-30  6:17     ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: KOSAKI Motohiro @ 2009-01-29  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Greg KH, mtk.manpages, linux-man, linux-kernel,
	Andrea Arcangeli

(CC to andrea)

> On Wed, 28 Jan 2009 13:33:22 -0800
> Greg KH <greg@kroah.com> wrote:
> 
> > In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> > with the 2.6 kernel release:
> > 	Under Linux 2.4, transfer sizes, and the alignment of the user
> > 	buffer and  the file offset must all be multiples of the logical
> > 	block size of the file system.  Under Linux 2.6, alignment  to
> > 	512-byte  boundaries suffices.
> > 
> > However if you try to access an O_DIRECT opened file with a buffer that
> > is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> > read.)
> > 
> 
> IIUC, it's not related to 512bytes boundary. Just a race between
> direct-io v.s. copy-on-write. Copy-on-Write while reading a page via DIO
> is a problem.

Yes.
Greg's reproducer is a bit misleading.

>	    for (j = 0; j < workers; j++) {
>		worker[j].offset = offset + j * PAGE_SIZE;
>		worker[j].buffer = buffer + align + j * PAGE_SIZE;
>		worker[j].length = PAGE_SIZE;
>	    }

this code mean,
  - if align == 0, reader thread touch only one page.
    and the page is touched only one thread.
  - if align != 0, reader thread touch two page.
    and the page is touched two thread.

then, race is happend if align != 0.
We discussed this issue with andrea last month.
("Corruption with O_DIRECT and unaligned user buffers" thread)

As far as I know, he is working on fixing this issue now.


> 
> Maybe it's true that if buffer is aligned to page size, no copy-on-write will
> happen in usual program. But assuming HugeTLB page, which does Copy-on-Write,
> data corruption will happen again. HugeTLB aligned buffer is nonsense.






^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-29  3:13       ` Greg KH
@ 2009-01-29 15:40         ` Jeff Moyer
  2009-01-30  6:16           ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Moyer @ 2009-01-29 15:40 UTC (permalink / raw)
  To: Greg KH; +Cc: mtk.manpages, Robert Hancock, linux-man, linux-kernel

Greg KH <greg@kroah.com> writes:

> On Thu, Jan 29, 2009 at 03:59:12PM +1300, Michael Kerrisk wrote:
>> On Thu, Jan 29, 2009 at 2:17 PM, Greg KH <greg@kroah.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jan 28, 2009 at 06:41:49PM -0600, Robert Hancock wrote:
>> >>
>> >>
>> >> Greg KH wrote:
>> >>> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
>> >>> with the 2.6 kernel release:
>> >>>      Under Linux 2.4, transfer sizes, and the alignment of the user
>> >>>      buffer and  the file offset must all be multiples of the logical
>> >>>      block size of the file system.  Under Linux 2.6, alignment  to
>> >>>      512-byte  boundaries suffices.
>> >>> However if you try to access an O_DIRECT opened file with a buffer that
>> >>> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
>> >>> read.)
>> >>> Is this just a mistake in the documentation?  Or am I reading it
>> >>> incorrectly?
>> >>> I have a test program that shows this if anyone wants it.
>> >>
>> >> Well, it sounds like a bug to me.. even if it's not supported, if you do
>> >> such an access, surely the kernel should detect that and return EINVAL or
>> >> something rather than reading corrupted data..
>> >
>> > It doesn't.  It says the read is successful, yet the data is not really
>> > read into the buffer.  Portions of it is, but not the amount we asked
>> > for.
>> 
>> Greg,
>> 
>> Can you post your test program?
>
> Sure, here it is.  I'm still not quite sure it is valid, but at first
> glance it seems to be.
>
> Run it once with no arguments and all of the files will be created.
> Then run it again with no offset being asked for:
> 	./dma_thread -a 0
> then with an offset:
> 	./dma_thread -a 512
>
> The second one breaks.

There are several folks working on this.  See "Corruption with O_DIRECT
and unaligned user buffers" on the linux-fsdevel list.  There is also a
Red Hat bugzilla for this (471613) that several folks have been working
through.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-29 15:40         ` Jeff Moyer
@ 2009-01-30  6:16           ` Greg KH
  0 siblings, 0 replies; 25+ messages in thread
From: Greg KH @ 2009-01-30  6:16 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: mtk.manpages, Robert Hancock, linux-man, linux-kernel

On Thu, Jan 29, 2009 at 10:40:22AM -0500, Jeff Moyer wrote:
> Greg KH <greg@kroah.com> writes:
> 
> > On Thu, Jan 29, 2009 at 03:59:12PM +1300, Michael Kerrisk wrote:
> >> On Thu, Jan 29, 2009 at 2:17 PM, Greg KH <greg@kroah.com> wrote:
> >> >
> >> >
> >> >
> >> > On Wed, Jan 28, 2009 at 06:41:49PM -0600, Robert Hancock wrote:
> >> >>
> >> >>
> >> >> Greg KH wrote:
> >> >>> In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> >> >>> with the 2.6 kernel release:
> >> >>>      Under Linux 2.4, transfer sizes, and the alignment of the user
> >> >>>      buffer and  the file offset must all be multiples of the logical
> >> >>>      block size of the file system.  Under Linux 2.6, alignment  to
> >> >>>      512-byte  boundaries suffices.
> >> >>> However if you try to access an O_DIRECT opened file with a buffer that
> >> >>> is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> >> >>> read.)
> >> >>> Is this just a mistake in the documentation?  Or am I reading it
> >> >>> incorrectly?
> >> >>> I have a test program that shows this if anyone wants it.
> >> >>
> >> >> Well, it sounds like a bug to me.. even if it's not supported, if you do
> >> >> such an access, surely the kernel should detect that and return EINVAL or
> >> >> something rather than reading corrupted data..
> >> >
> >> > It doesn't.  It says the read is successful, yet the data is not really
> >> > read into the buffer.  Portions of it is, but not the amount we asked
> >> > for.
> >> 
> >> Greg,
> >> 
> >> Can you post your test program?
> >
> > Sure, here it is.  I'm still not quite sure it is valid, but at first
> > glance it seems to be.
> >
> > Run it once with no arguments and all of the files will be created.
> > Then run it again with no offset being asked for:
> > 	./dma_thread -a 0
> > then with an offset:
> > 	./dma_thread -a 512
> >
> > The second one breaks.
> 
> There are several folks working on this.  See "Corruption with O_DIRECT
> and unaligned user buffers" on the linux-fsdevel list.  There is also a
> Red Hat bugzilla for this (471613) that several folks have been working
> through.

Thanks for the pointers to that, I'll follow along with it.

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-29  7:10   ` KOSAKI Motohiro
@ 2009-01-30  6:17     ` Greg KH
  2009-02-02 22:08       ` Andrea Arcangeli
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2009-01-30  6:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, mtk.manpages, linux-man, linux-kernel,
	Andrea Arcangeli

On Thu, Jan 29, 2009 at 04:10:39PM +0900, KOSAKI Motohiro wrote:
> (CC to andrea)
> 
> > On Wed, 28 Jan 2009 13:33:22 -0800
> > Greg KH <greg@kroah.com> wrote:
> > 
> > > In looking at open(2), it says that O_DIRECT works on 512 byte boundries
> > > with the 2.6 kernel release:
> > > 	Under Linux 2.4, transfer sizes, and the alignment of the user
> > > 	buffer and  the file offset must all be multiples of the logical
> > > 	block size of the file system.  Under Linux 2.6, alignment  to
> > > 	512-byte  boundaries suffices.
> > > 
> > > However if you try to access an O_DIRECT opened file with a buffer that
> > > is PAGE_SIZE aligned + 512 bytes, it fails in a bad way (wrong data is
> > > read.)
> > > 
> > 
> > IIUC, it's not related to 512bytes boundary. Just a race between
> > direct-io v.s. copy-on-write. Copy-on-Write while reading a page via DIO
> > is a problem.
> 
> Yes.
> Greg's reproducer is a bit misleading.
> 
> >	    for (j = 0; j < workers; j++) {
> >		worker[j].offset = offset + j * PAGE_SIZE;
> >		worker[j].buffer = buffer + align + j * PAGE_SIZE;
> >		worker[j].length = PAGE_SIZE;
> >	    }
> 
> this code mean,
>   - if align == 0, reader thread touch only one page.
>     and the page is touched only one thread.
>   - if align != 0, reader thread touch two page.
>     and the page is touched two thread.
> 
> then, race is happend if align != 0.
> We discussed this issue with andrea last month.
> ("Corruption with O_DIRECT and unaligned user buffers" thread)
> 
> As far as I know, he is working on fixing this issue now.

Thanks for the pointers, I'll go read the thread and follow up there.

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-01-30  6:17     ` Greg KH
@ 2009-02-02 22:08       ` Andrea Arcangeli
  2009-02-03  1:29         ` KAMEZAWA Hiroyuki
                           ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-02 22:08 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel

Hi Greg!

> Thanks for the pointers, I'll go read the thread and follow up there.

If you also run into this final fix is attached below. Porting to
mainline is a bit hard because of gup-fast... Perhaps we can use mmu
notifiers to fix gup-fast... need to think more about it then I'll
post something.

Please help testing the below on pre-gup-fast kernels, thanks!

From: Andrea Arcangeli <aarcange@redhat.com>
Subject: fork-o_direct-race

Think a thread writing constantly to the last 512bytes of a page, while another
thread read and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
the very moment we mark any pte wrprotected because a third unrelated thread
forks off a child.

This fixes it by never wprotecting anon ptes if there can be any direct I/O in
flight to the page, and by instantiating a readonly pte and triggering a COW in
the child. The only trouble here are O_DIRECT reads (writes to memory, read
from disk). Checking the page_count under the PT lock guarantees no
get_user_pages could be running under us because if somebody wants to write to
the page, it has to break any cow first and that requires taking the PT lock in
follow_page before increasing the page count. We are guaranteed mapcount is 1 if
fork is writeprotecting the pte so the PT lock is enough to serialize against
get_user_pages->get_page.

The COW triggered inside fork will run while the parent pte is readonly to
provide as usual the per-page atomic copy from parent to child during fork.
However timings will be altered by having to copy the pages that might be under
O_DIRECT.

The pagevec code calls get_page while the page is sitting in the pagevec
(before it becomes PageLRU) and doing so it can generate false positives, so to
avoid slowing down fork all the time even for pages that could never possibly
be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
most overhead of the fix in fork.

Patch doesn't break kABI despite introducing a new page flag.

Fixed version of original patch from Nick Piggin.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff -ur 2/include/linux/page-flags.h 1/include/linux/page-flags.h
--- 2/include/linux/page-flags.h	2008-07-10 17:26:44.000000000 +0200
+++ 1/include/linux/page-flags.h	2009-02-02 05:33:42.000000000 +0100
@@ -86,6 +86,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_gup			20	/* Page pin may be because of gup */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked              PG_owner_priv_1 /* Used by some filesystems */
@@ -238,6 +239,10 @@
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define SetPageGUP(page)	set_bit(PG_gup, &(page)->flags)
+#define PageGUP(page)		test_bit(PG_gup, &(page)->flags)
+#define __ClearPageGUP(page)	__clear_bit(PG_gup, &(page)->flags)
+
 /*
  * PG_reclaim is used in combination with PG_compound to mark the
  * head and tail of a compound page
diff -ur 2/kernel/fork.c 1/kernel/fork.c
--- 2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
+++ 1/kernel/fork.c	2009-02-02 05:24:17.000000000 +0100
@@ -368,7 +368,7 @@
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Only in 1: ldtest7557.out
diff -ur 2/mm/memory.c 1/mm/memory.c
--- 2/mm/memory.c	2008-07-10 17:26:44.000000000 +0200
+++ 1/mm/memory.c	2009-02-02 06:17:05.000000000 +0100
@@ -426,7 +426,7 @@
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -434,6 +434,7 @@
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int forcecow = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -464,15 +465,6 @@
 	}
 
 	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = *src_pte;
-	}
-
-	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
 	 */
@@ -484,13 +476,41 @@
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
+		if (is_cow_mapping(vm_flags) && PageAnon(page) &&
+		    PageGUP(page)) {
+			if (unlikely(TestSetPageLocked(page)))
+				forcecow = 1;
+			else {
+				if (unlikely(page_count(page) !=
+					     page_mapcount(page)
+					     + !!PageSwapCache(page)))
+					forcecow = 1;
+				unlock_page(page);
+			}
+		}
 		rss[!!PageAnon(page)]++;
 	}
 
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child.
+	 */
+	if (is_cow_mapping(vm_flags)) {
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte = pte_wrprotect(pte);
+	}
+
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return forcecow;
 }
 
+static void fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			 unsigned long address,
+			 pte_t *src_pte, pte_t *dst_pte,
+			 struct page *pre_cow_page);
+
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
@@ -499,17 +519,30 @@
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int forcecow;
+	struct page *pre_cow_page = NULL;
 
 again:
+	if (!pre_cow_page) {
+		pre_cow_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		if (!pre_cow_page)
+			return -ENOMEM;
+	}
+	forcecow = 0;
 	rss[1] = rss[0] = 0;
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
-	if (!dst_pte)
+	if (!dst_pte) {
+		page_cache_release(pre_cow_page);
 		return -ENOMEM;
+	}
 	src_pte = pte_offset_map_nested(src_pmd, addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
 	do {
+		if (forcecow)
+			break;
+
 		/*
 		 * We are holding two locks at this point - either of them
 		 * could generate latencies in another task on another CPU.
@@ -525,10 +558,26 @@
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(forcecow)) {
+		/*
+		 * Try to COW the child page as direct I/O is working
+		 * on the parent page, and so we've to mark the parent
+		 * pte read-write before dropping the PT lock to avoid
+		 * the page to be cowed in the parent and any direct
+		 * I/O to get lost.
+		 */
+		fork_pre_cow(dst_mm, vma, addr-PAGE_SIZE,
+			     src_pte-1, dst_pte-1, pre_cow_page);
+		/* after the page copy set the parent pte writeable */
+		set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1,
+			   pte_mkwrite(*(src_pte-1)));
+		pre_cow_page = NULL;
+	}
+
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
@@ -536,6 +585,8 @@
 	cond_resched();
 	if (addr != end)
 		goto again;
+	if (pre_cow_page)
+		page_cache_release(pre_cow_page);
 	return 0;
 }
 
@@ -956,8 +1007,11 @@
 	if (unlikely(!page))
 		goto unlock;
 
-	if (flags & FOLL_GET)
+	if (flags & FOLL_GET) {
 		get_page(page);
+		if ((flags & FOLL_WRITE) && PageAnon(page) && !PageGUP(page))
+			SetPageGUP(page);
+	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1607,6 +1661,30 @@
 	copy_user_highpage(dst, src, va);
 }
 
+static void fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			 unsigned long address,
+			 pte_t *src_pte, pte_t *dst_pte,
+			 struct page *pre_cow_page)
+{
+	pte_t entry;
+	struct page *old_page;
+
+	old_page = vm_normal_page(vma, address, *src_pte);
+	BUG_ON(!old_page);
+	cow_user_page(pre_cow_page, old_page, address);
+	page_remove_rmap(old_page);
+	page_cache_release(old_page);
+
+	flush_cache_page(vma, address, pte_pfn(*src_pte));
+	entry = mk_pte(pre_cow_page, vma->vm_page_prot);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	page_add_new_anon_rmap(pre_cow_page, vma, address);
+	lru_cache_add_active(pre_cow_page);
+	set_pte_at(mm, address, dst_pte, entry);
+	update_mmu_cache(vma, address, entry);
+	lazy_mmu_prot_update(entry);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
diff -ur 2/mm/page_alloc.c 1/mm/page_alloc.c
--- 2/mm/page_alloc.c	2008-07-10 17:26:44.000000000 +0200
+++ 1/mm/page_alloc.c	2009-02-02 05:33:18.000000000 +0100
@@ -154,6 +154,7 @@
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_gup |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -400,6 +401,8 @@
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageGUP(page))
+		__ClearPageGUP(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -546,6 +549,7 @@
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_gup |
 			1 << PG_buddy ))))
 		bad_page(page);
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-02 22:08       ` Andrea Arcangeli
@ 2009-02-03  1:29         ` KAMEZAWA Hiroyuki
  2009-02-03  2:31           ` Andrea Arcangeli
  2009-02-03  3:50         ` Greg KH
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-03  1:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Mon, 2 Feb 2009 23:08:56 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Hi Greg!
> 
> > Thanks for the pointers, I'll go read the thread and follow up there.
> 
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
> 
> Please help testing the below on pre-gup-fast kernels, thanks!
> 
I commented in FJ-Redhat Path but not forwared from unknown reason ;)
I comment again.

1. Why TestSetLockPage() is necessary ?
   It seems not necesary.

2. This patch doesn't cover HugeTLB.

3. Why "follow_page() successfully finds a page" case only ?
 not necessary to insert SetPageGUP() in following path ?

 - handle_mm_fault()
           => do_anonymos/swap/wp_page()
           or some.

BTW, when you write a patch against upstream, please CC me or linux-mm.
I'll have to add a hook for memory-cgroup.

Thanks,
-Kame


> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: fork-o_direct-race
> 
> Think a thread writing constantly to the last 512bytes of a page, while another
> thread read and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
> the very moment we mark any pte wrprotected because a third unrelated thread
> forks off a child.
> 
> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
> flight to the page, and by instantiating a readonly pte and triggering a COW in
> the child. The only trouble here are O_DIRECT reads (writes to memory, read
> from disk). Checking the page_count under the PT lock guarantees no
> get_user_pages could be running under us because if somebody wants to write to
> the page, it has to break any cow first and that requires taking the PT lock in
> follow_page before increasing the page count. We are guaranteed mapcount is 1 if
> fork is writeprotecting the pte so the PT lock is enough to serialize against
> get_user_pages->get_page.
> 
> The COW triggered inside fork will run while the parent pte is readonly to
> provide as usual the per-page atomic copy from parent to child during fork.
> However timings will be altered by having to copy the pages that might be under
> O_DIRECT.
> 
> The pagevec code calls get_page while the page is sitting in the pagevec
> (before it becomes PageLRU) and doing so it can generate false positives, so to
> avoid slowing down fork all the time even for pages that could never possibly
> be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
> most overhead of the fix in fork.
> 
> Patch doesn't break kABI despite introducing a new page flag.
> 
> Fixed version of original patch from Nick Piggin.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff -ur 2/include/linux/page-flags.h 1/include/linux/page-flags.h
> --- 2/include/linux/page-flags.h	2008-07-10 17:26:44.000000000 +0200
> +++ 1/include/linux/page-flags.h	2009-02-02 05:33:42.000000000 +0100
> @@ -86,6 +86,7 @@
>  #define PG_reclaim		17	/* To be reclaimed asap */
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
> +#define PG_gup			20	/* Page pin may be because of gup */
>  
>  /* PG_owner_priv_1 users should have descriptive aliases */
>  #define PG_checked              PG_owner_priv_1 /* Used by some filesystems */
> @@ -238,6 +239,10 @@
>  #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
>  #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
>  
> +#define SetPageGUP(page)	set_bit(PG_gup, &(page)->flags)
> +#define PageGUP(page)		test_bit(PG_gup, &(page)->flags)
> +#define __ClearPageGUP(page)	__clear_bit(PG_gup, &(page)->flags)
> +
>  /*
>   * PG_reclaim is used in combination with PG_compound to mark the
>   * head and tail of a compound page
> diff -ur 2/kernel/fork.c 1/kernel/fork.c
> --- 2/kernel/fork.c	2008-07-10 17:26:43.000000000 +0200
> +++ 1/kernel/fork.c	2009-02-02 05:24:17.000000000 +0100
> @@ -368,7 +368,7 @@
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, oldmm, mpnt);
> +		retval = copy_page_range(mm, oldmm, tmp);
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> Only in 1: ldtest7557.out
> diff -ur 2/mm/memory.c 1/mm/memory.c
> --- 2/mm/memory.c	2008-07-10 17:26:44.000000000 +0200
> +++ 1/mm/memory.c	2009-02-02 06:17:05.000000000 +0100
> @@ -426,7 +426,7 @@
>   * covered by this vma.
>   */
>  
> -static inline void
> +static inline int
>  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
>  		unsigned long addr, int *rss)
> @@ -434,6 +434,7 @@
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
>  	struct page *page;
> +	int forcecow = 0;
>  
>  	/* pte contains position in swap or file, so copy. */
>  	if (unlikely(!pte_present(pte))) {
> @@ -464,15 +465,6 @@
>  	}
>  
>  	/*
> -	 * If it's a COW mapping, write protect it both
> -	 * in the parent and the child
> -	 */
> -	if (is_cow_mapping(vm_flags)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> -		pte = *src_pte;
> -	}
> -
> -	/*
>  	 * If it's a shared mapping, mark it clean in
>  	 * the child
>  	 */
> @@ -484,13 +476,41 @@
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page);
> +		if (is_cow_mapping(vm_flags) && PageAnon(page) &&
> +		    PageGUP(page)) {
> +			if (unlikely(TestSetPageLocked(page)))
> +				forcecow = 1;
> +			else {
> +				if (unlikely(page_count(page) !=
> +					     page_mapcount(page)
> +					     + !!PageSwapCache(page)))
> +					forcecow = 1;
> +				unlock_page(page);
> +			}
> +		}
>  		rss[!!PageAnon(page)]++;
>  	}
>  
> +	/*
> +	 * If it's a COW mapping, write protect it both
> +	 * in the parent and the child.
> +	 */
> +	if (is_cow_mapping(vm_flags)) {
> +		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		pte = pte_wrprotect(pte);
> +	}
> +
>  out_set_pte:
>  	set_pte_at(dst_mm, addr, dst_pte, pte);
> +
> +	return forcecow;
>  }
>  
> +static void fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> +			 unsigned long address,
> +			 pte_t *src_pte, pte_t *dst_pte,
> +			 struct page *pre_cow_page);
> +
>  static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
>  		unsigned long addr, unsigned long end)
> @@ -499,17 +519,30 @@
>  	spinlock_t *src_ptl, *dst_ptl;
>  	int progress = 0;
>  	int rss[2];
> +	int forcecow;
> +	struct page *pre_cow_page = NULL;
>  
>  again:
> +	if (!pre_cow_page) {
> +		pre_cow_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +		if (!pre_cow_page)
> +			return -ENOMEM;
> +	}
> +	forcecow = 0;
>  	rss[1] = rss[0] = 0;
>  	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> -	if (!dst_pte)
> +	if (!dst_pte) {
> +		page_cache_release(pre_cow_page);
>  		return -ENOMEM;
> +	}
>  	src_pte = pte_offset_map_nested(src_pmd, addr);
>  	src_ptl = pte_lockptr(src_mm, src_pmd);
>  	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>  
>  	do {
> +		if (forcecow)
> +			break;
> +
>  		/*
>  		 * We are holding two locks at this point - either of them
>  		 * could generate latencies in another task on another CPU.
> @@ -525,10 +558,26 @@
>  			progress++;
>  			continue;
>  		}
> -		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
> +		forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
>  		progress += 8;
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(forcecow)) {
> +		/*
> +		 * Try to COW the child page as direct I/O is working
> +		 * on the parent page, and so we've to mark the parent
> +		 * pte read-write before dropping the PT lock to avoid
> +		 * the page to be cowed in the parent and any direct
> +		 * I/O to get lost.
> +		 */
> +		fork_pre_cow(dst_mm, vma, addr-PAGE_SIZE,
> +			     src_pte-1, dst_pte-1, pre_cow_page);
> +		/* after the page copy set the parent pte writeable */
> +		set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1,
> +			   pte_mkwrite(*(src_pte-1)));
> +		pre_cow_page = NULL;
> +	}
> +
>  	spin_unlock(src_ptl);
>  	pte_unmap_nested(src_pte - 1);
>  	add_mm_rss(dst_mm, rss[0], rss[1]);
> @@ -536,6 +585,8 @@
>  	cond_resched();
>  	if (addr != end)
>  		goto again;
> +	if (pre_cow_page)
> +		page_cache_release(pre_cow_page);
>  	return 0;
>  }
>  
> @@ -956,8 +1007,11 @@
>  	if (unlikely(!page))
>  		goto unlock;
>  
> -	if (flags & FOLL_GET)
> +	if (flags & FOLL_GET) {
>  		get_page(page);
> +		if ((flags & FOLL_WRITE) && PageAnon(page) && !PageGUP(page))
> +			SetPageGUP(page);
> +	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
>  		    !pte_dirty(pte) && !PageDirty(page))
> @@ -1607,6 +1661,30 @@
>  	copy_user_highpage(dst, src, va);
>  }
>  
> +static void fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> +			 unsigned long address,
> +			 pte_t *src_pte, pte_t *dst_pte,
> +			 struct page *pre_cow_page)
> +{
> +	pte_t entry;
> +	struct page *old_page;
> +
> +	old_page = vm_normal_page(vma, address, *src_pte);
> +	BUG_ON(!old_page);
> +	cow_user_page(pre_cow_page, old_page, address);
> +	page_remove_rmap(old_page);
> +	page_cache_release(old_page);
> +
> +	flush_cache_page(vma, address, pte_pfn(*src_pte));
> +	entry = mk_pte(pre_cow_page, vma->vm_page_prot);
> +	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	page_add_new_anon_rmap(pre_cow_page, vma, address);
> +	lru_cache_add_active(pre_cow_page);
> +	set_pte_at(mm, address, dst_pte, entry);
> +	update_mmu_cache(vma, address, entry);
> +	lazy_mmu_prot_update(entry);
> +}
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> diff -ur 2/mm/page_alloc.c 1/mm/page_alloc.c
> --- 2/mm/page_alloc.c	2008-07-10 17:26:44.000000000 +0200
> +++ 1/mm/page_alloc.c	2009-02-02 05:33:18.000000000 +0100
> @@ -154,6 +154,7 @@
>  			1 << PG_slab    |
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
> +			1 << PG_gup |
>  			1 << PG_buddy );
>  	set_page_count(page, 0);
>  	reset_page_mapcount(page);
> @@ -400,6 +401,8 @@
>  		bad_page(page);
>  	if (PageDirty(page))
>  		__ClearPageDirty(page);
> +	if (PageGUP(page))
> +		__ClearPageGUP(page);
>  	/*
>  	 * For now, we report if PG_reserved was found set, but do not
>  	 * clear it, and do not free the page.  But we shall soon need
> @@ -546,6 +549,7 @@
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
>  			1 << PG_reserved |
> +			1 << PG_gup |
>  			1 << PG_buddy ))))
>  		bad_page(page);
>  
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  1:29         ` KAMEZAWA Hiroyuki
@ 2009-02-03  2:31           ` Andrea Arcangeli
  2009-02-03  2:55             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-03  2:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Tue, Feb 03, 2009 at 10:29:20AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 2 Feb 2009 23:08:56 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Hi Greg!
> > 
> > > Thanks for the pointers, I'll go read the thread and follow up there.
> > 
> > If you also run into this final fix is attached below. Porting to
> > mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> > notifiers to fix gup-fast... need to think more about it then I'll
> > post something.
> > 
> > Please help testing the below on pre-gup-fast kernels, thanks!
> > 
> I commented in FJ-Redhat Path but not forwared from unknown reason ;)
> I comment again.
> 
> 1. Why TestSetLockPage() is necessary ?
>    It seems not necesary.

To avoid the VM to remove or add the page from/to swapcache and change
page_count/mapcount from under us. This most certainly wasn't the
reason of the slowdown (the slowdown were the false positives
generated by pagevec pinning) and removing it was more intrusive than
I wanted.

> 2. This patch doesn't cover HugeTLB.

There's no need to change hugetlb with my approach. I'm not touching
the cow path, I'm addressing the real source of the problem (i.e. when
fork pretends to mark the child pte readonly and pointing to the
shared parent page, same as ksm: while the pte wrprotect + tlb flush
stops the _cpu_ it can't stop any get_user_pages(write=1) user, hence
we need to pre-cow the child page in fork instead of marking the child
pte readonly to avoid the parent to lose writes if post-fork the
parent cows and the child doesn't cow).

> 3. Why "follow_page() successfully finds a page" case only ?
>  not necessary to insert SetPageGUP() in following path ?
> 
>  - handle_mm_fault()
>            => do_anonymos/swap/wp_page()
>            or some.

No need to change that either, all we need to know are the pages whose
count vs mapcount has a discrepancy that could have been caused by
get_user_pages. So only follow_page has to set it. More precisely
FOLL_GET|FOLL_WRITE is the only path we care about there.

> BTW, when you write a patch against upstream, please CC me or linux-mm.
> I'll have to add a hook for memory-cgroup.

Sure.

BTW, despite I didn't reproduce the problem here while leaving the
./dma_thread -a 512 -w 40 workload run half a day, others reported me
trouble but it was on a different kernel codebase, but at this time
I'm unsure if any remaining trouble is caused by some imperfection in
this patch or something else. Test results would be interesting
basically. Patch is against rhel-5.2 but should be trivial to apply to
anything pre-get_user_pages_fast.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  2:31           ` Andrea Arcangeli
@ 2009-02-03  2:55             ` KAMEZAWA Hiroyuki
  2009-02-03  3:42               ` KAMEZAWA Hiroyuki
  2009-02-06 17:55               ` Andrea Arcangeli
  0 siblings, 2 replies; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-03  2:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Tue, 3 Feb 2009 03:31:47 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Feb 03, 2009 at 10:29:20AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 2 Feb 2009 23:08:56 +0100
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > Hi Greg!
> > > 
> > > > Thanks for the pointers, I'll go read the thread and follow up there.
> > > 
> > > If you also run into this final fix is attached below. Porting to
> > > mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> > > notifiers to fix gup-fast... need to think more about it then I'll
> > > post something.
> > > 
> > > Please help testing the below on pre-gup-fast kernels, thanks!
> > > 
> > I commented in FJ-Redhat Path but not forwared from unknown reason ;)
> > I comment again.
> > 
> > 1. Why TestSetLockPage() is necessary ?
> >    It seems not necesary.
> 
> To avoid the VM to remove or add the page from/to swapcache and change
> page_count/mapcount from under us. This most certainly wasn't the
> reason of the slowdown (the slowdown were the false positives
> generated by pagevec pinning) and removing it was more intrusive than
> I wanted.

My point is.
  - If TestSetLockPage() failes, force_cow=1.
  - If count/mapcount check fails, force_cow=1.

So, lock_page() here seems meaningless. If you consider lock_page() is important,
just use lock_page() seems better.

> 
> > 2. This patch doesn't cover HugeTLB.
> 
> There's no need to change hugetlb with my approach. I'm not touching
> the cow path, I'm addressing the real source of the problem (i.e. when
> fork pretends to mark the child pte readonly and pointing to the
> shared parent page, same as ksm: while the pte wrprotect + tlb flush
> stops the _cpu_ it can't stop any get_user_pages(write=1) user, hence
> we need to pre-cow the child page in fork instead of marking the child
> pte readonly to avoid the parent to lose writes if post-fork the
> parent cows and the child doesn't cow).
> 
No need to make a patch for copy_hugetlb_page_range() ?
IMHO, HugeTLB can be write-protected at fork().

> > 3. Why "follow_page() successfully finds a page" case only ?
> >  not necessary to insert SetPageGUP() in following path ?
> > 
> >  - handle_mm_fault()
> >            => do_anonymos/swap/wp_page()
> >            or some.
> 
> No need to change that either, all we need to know are the pages whose
> count vs mapcount has a discrepancy that could have been caused by
> get_user_pages. So only follow_page has to set it. More precisely
> FOLL_GET|FOLL_WRITE is the only path we care about there.
> 

Assume 3 threads in a process.
==
 Thread1 (DIO-Read)                        Thread2           Thread3
 get_user_page()
 => handle_mm_fault().
    => map a page with no-write-protect.
                                            fork()
                                      (write-protect here)
                                                              Copy-On-Write
 endio.

pre-cow-at-fork will never happen becasue PageGUP is not set. 
After the end of READ, this process will see a broken page.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  2:55             ` KAMEZAWA Hiroyuki
@ 2009-02-03  3:42               ` KAMEZAWA Hiroyuki
  2009-02-06 17:55               ` Andrea Arcangeli
  1 sibling, 0 replies; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-03  3:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Greg KH, KOSAKI Motohiro, mtk.manpages,
	linux-man, linux-kernel

On Tue, 3 Feb 2009 11:55:40 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 3 Feb 2009 03:31:47 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Tue, Feb 03, 2009 at 10:29:20AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 2 Feb 2009 23:08:56 +0100
> > > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > 
> > > > Hi Greg!
> > > > 
> > > > > Thanks for the pointers, I'll go read the thread and follow up there.
> > > > 
> > > > If you also run into this final fix is attached below. Porting to
> > > > mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> > > > notifiers to fix gup-fast... need to think more about it then I'll
> > > > post something.
> > > > 
> > > > Please help testing the below on pre-gup-fast kernels, thanks!
> > > > 
> > > I commented in FJ-Redhat Path but not forwared from unknown reason ;)
> > > I comment again.
> > > 
> > > 1. Why TestSetLockPage() is necessary ?
> > >    It seems not necesary.
> > 
> > To avoid the VM to remove or add the page from/to swapcache and change
> > page_count/mapcount from under us. This most certainly wasn't the
> > reason of the slowdown (the slowdown were the false positives
> > generated by pagevec pinning) and removing it was more intrusive than
> > I wanted.
> 
> My point is.
>   - If TestSetLockPage() failes, force_cow=1.
>   - If count/mapcount check fails, force_cow=1.
> 
> So, lock_page() here seems meaningless. If you consider lock_page() is important,
> just use lock_page() seems better.
> 
I changed my thinking. I understood *lock* is necessary here.
But I'm not sure this is enough.



> > > 3. Why "follow_page() successfully finds a page" case only ?
> > >  not necessary to insert SetPageGUP() in following path ?
> > > 
> > >  - handle_mm_fault()
> > >            => do_anonymos/swap/wp_page()
> > >            or some.
> > 
> > No need to change that either, all we need to know are the pages whose
> > count vs mapcount has a discrepancy that could have been caused by
> > get_user_pages. So only follow_page has to set it. More precisely
> > FOLL_GET|FOLL_WRITE is the only path we care about there.
> > 
> 
> Assume 3 threads in a process.
please ignore this, I'm shame ;(

-Kame


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-02 22:08       ` Andrea Arcangeli
  2009-02-03  1:29         ` KAMEZAWA Hiroyuki
@ 2009-02-03  3:50         ` Greg KH
  2009-02-03 15:01           ` Andrea Arcangeli
  2009-02-03  4:13         ` KAMEZAWA Hiroyuki
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2009-02-03  3:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel

On Mon, Feb 02, 2009 at 11:08:56PM +0100, Andrea Arcangeli wrote:
> Hi Greg!
> 
> > Thanks for the pointers, I'll go read the thread and follow up there.
> 
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
> 
> Please help testing the below on pre-gup-fast kernels, thanks!

Thanks a lot for the patch, I'll try this out tomorrow.

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-02 22:08       ` Andrea Arcangeli
  2009-02-03  1:29         ` KAMEZAWA Hiroyuki
  2009-02-03  3:50         ` Greg KH
@ 2009-02-03  4:13         ` KAMEZAWA Hiroyuki
  2009-02-03  4:38         ` KAMEZAWA Hiroyuki
  2009-02-04 23:41         ` Greg KH
  4 siblings, 0 replies; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-03  4:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Mon, 2 Feb 2009 23:08:56 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Hi Greg!
> 
> > Thanks for the pointers, I'll go read the thread and follow up there.
> 
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
> 
> Please help testing the below on pre-gup-fast kernels, thanks!
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: fork-o_direct-race
> 
> Think a thread writing constantly to the last 512bytes of a page, while another
> thread read and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
> the very moment we mark any pte wrprotected because a third unrelated thread
> forks off a child.
> 
> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
> flight to the page, and by instantiating a readonly pte and triggering a COW in
> the child. The only trouble here are O_DIRECT reads (writes to memory, read
> from disk). Checking the page_count under the PT lock guarantees no
> get_user_pages could be running under us because if somebody wants to write to
> the page, it has to break any cow first and that requires taking the PT lock in
> follow_page before increasing the page count. We are guaranteed mapcount is 1 if
> fork is writeprotecting the pte so the PT lock is enough to serialize against
> get_user_pages->get_page.
> 
> The COW triggered inside fork will run while the parent pte is readonly to
> provide as usual the per-page atomic copy from parent to child during fork.
> However timings will be altered by having to copy the pages that might be under
> O_DIRECT.
> 
> The pagevec code calls get_page while the page is sitting in the pagevec
> (before it becomes PageLRU) and doing so it can generate false positives, so to
> avoid slowing down fork all the time even for pages that could never possibly
> be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
> most overhead of the fix in fork.
> 
> Patch doesn't break kABI despite introducing a new page flag.
> 
> Fixed version of original patch from Nick Piggin.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
Just curious, could you confirm following ?

(following is from RHEL5.3)

It seems page_cache_release() is called before page_cache_get().
Doesn't this break your assumption ? 
(I mean following code is buggy.)
Do we have extra page_count() somewhere before page_cache_release() ?

=
 667 submit_page_section(struct dio *dio, struct page *page,
 668                 unsigned offset, unsigned len, sector_t blocknr)
 669 {
 670         int ret = 0;
 671 
 672         /*
 673          * Can we just grow the current page's presence in the dio?
 674          */
 675         if (    (dio->cur_page == page) &&
 676                 (dio->cur_page_offset + dio->cur_page_len == offset) &&
 677                 (dio->cur_page_block +
 678                         (dio->cur_page_len >> dio->blkbits) == blocknr)) {
 679                 dio->cur_page_len += len;
 680 
 681                 /*
 682                  * If dio->boundary then we want to schedule the IO now to
 683                  * avoid metadata seeks.
 684                  */
 685                 if (dio->boundary) {
 686                         ret = dio_send_cur_page(dio);
 687                         page_cache_release(dio->cur_page);
 688                         dio->cur_page = NULL;
 689                 }
 690                 goto out;
 691         }
 692 
 693         /*
 694          * If there's a deferred page already there then send it.
 695          */
 696         if (dio->cur_page) {
 697                 ret = dio_send_cur_page(dio);
 698                 page_cache_release(dio->cur_page);
 699                 dio->cur_page = NULL;
 700                 if (ret)
 701                         goto out;
 702         }
 703 
 704         page_cache_get(page);           /* It is in dio */
 705         dio->cur_page = page;
 706         dio->cur_page_offset = offset;
 707         dio->cur_page_len = len;
 708         dio->cur_page_block = blocknr;
 709 out:
 710         return ret;
 711 }
==

Regards,
-Kame





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-02 22:08       ` Andrea Arcangeli
                           ` (2 preceding siblings ...)
  2009-02-03  4:13         ` KAMEZAWA Hiroyuki
@ 2009-02-03  4:38         ` KAMEZAWA Hiroyuki
  2009-02-03 15:08           ` Andrea Arcangeli
  2009-02-04 23:41         ` Greg KH
  4 siblings, 1 reply; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-03  4:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Mon, 2 Feb 2009 23:08:56 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Hi Greg!
> 
> > Thanks for the pointers, I'll go read the thread and follow up there.
> 
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
> 
> Please help testing the below on pre-gup-fast kernels, thanks!
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: fork-o_direct-race
> 
> Think a thread writing constantly to the last 512bytes of a page, while another
> thread read and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
> the very moment we mark any pte wrprotected because a third unrelated thread
> forks off a child.
> 
> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
> flight to the page, and by instantiating a readonly pte and triggering a COW in
> the child. The only trouble here are O_DIRECT reads (writes to memory, read
> from disk). Checking the page_count under the PT lock guarantees no
> get_user_pages could be running under us because if somebody wants to write to
> the page, it has to break any cow first and that requires taking the PT lock in
> follow_page before increasing the page count. We are guaranteed mapcount is 1 if
> fork is writeprotecting the pte so the PT lock is enough to serialize against
> get_user_pages->get_page.
> 
> The COW triggered inside fork will run while the parent pte is readonly to
> provide as usual the per-page atomic copy from parent to child during fork.
> However timings will be altered by having to copy the pages that might be under
> O_DIRECT.
> 
> The pagevec code calls get_page while the page is sitting in the pagevec
> (before it becomes PageLRU) and doing so it can generate false positives, so to
> avoid slowing down fork all the time even for pages that could never possibly
> be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
> most overhead of the fix in fork.
> 
> Patch doesn't break kABI despite introducing a new page flag.
> 
> Fixed version of original patch from Nick Piggin.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Sorry, one more ;)
==
1117                         cond_resched();
1118                         while (!(page = follow_page(vma, start, foll_flags))) {
1119                                 int ret;
1120                                 ret = __handle_mm_fault(mm, vma, start,
1121                                                 foll_flags & FOLL_WRITE);
1122                                 /*
1123                                  * The VM_FAULT_WRITE bit tells us that do_wp_page has
1124                                  * broken COW when necessary, even if maybe_mkwrite
1125                                  * decided not to set pte_write. We can thus safely do
1126                                  * subsequent page lookups as if they were reads.
1127                                  */
1128                                 if (ret & VM_FAULT_WRITE)
1129                                         foll_flags &= ~FOLL_WRITE;
==

>From above, FOLL_WRITE can be dropped and PageGUP() will not be set ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  3:50         ` Greg KH
@ 2009-02-03 15:01           ` Andrea Arcangeli
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-03 15:01 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel

On Mon, Feb 02, 2009 at 07:50:12PM -0800, Greg KH wrote:
> On Mon, Feb 02, 2009 at 11:08:56PM +0100, Andrea Arcangeli wrote:
> > Hi Greg!
> > 
> > > Thanks for the pointers, I'll go read the thread and follow up there.
> > 
> > If you also run into this final fix is attached below. Porting to
> > mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> > notifiers to fix gup-fast... need to think more about it then I'll
> > post something.
> > 
> > Please help testing the below on pre-gup-fast kernels, thanks!
> 
> Thanks a lot for the patch, I'll try this out tomorrow.

before testing I recommend to make this change:

-       if ((flags & FOLL_WRITE) && PageAnon(page) && !PageGUP(page))
+	if (PageAnon(page) && !PageGUP(page))

If gup triggers cow, gup clears FOLL_WRITE from foll_flags before
calling follow_page again, I think that's why it didn't work right for
everyone, it just couldn't work right if FOLL_WRITE was going away
before the last follow_page run. I wanted to set PG_gup only for
gup(write=1) and not for gup(write=0) but that optimization that
clears FOLL_WRITE prevents the above to work right, so the simple
solution is to always set PG_gup, it won't make a big difference.

I'm looking at the hugetlb cow before posting an update...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  4:38         ` KAMEZAWA Hiroyuki
@ 2009-02-03 15:08           ` Andrea Arcangeli
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-03 15:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Tue, Feb 03, 2009 at 01:38:11PM +0900, KAMEZAWA Hiroyuki wrote:
> From above, FOLL_WRITE can be dropped and PageGUP() will not be set ?

Yeah, this was an implementation trouble!! If cow triggered, PG_gup
wasn't getting set as it should have.

The only remaining bit is to check hugetlb pages...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-02 22:08       ` Andrea Arcangeli
                           ` (3 preceding siblings ...)
  2009-02-03  4:38         ` KAMEZAWA Hiroyuki
@ 2009-02-04 23:41         ` Greg KH
  2009-02-06 17:54           ` Andrea Arcangeli
  4 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2009-02-04 23:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel

On Mon, Feb 02, 2009 at 11:08:56PM +0100, Andrea Arcangeli wrote:
> Hi Greg!
> 
> > Thanks for the pointers, I'll go read the thread and follow up there.
> 
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
> 
> Please help testing the below on pre-gup-fast kernels, thanks!

What kernel version was this patch against?  I can't seem to apply it to
anything close to mainline :(

I tried 2.6.29-rc3, 2.6.28-stable and 2.6.27-stable.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-04 23:41         ` Greg KH
@ 2009-02-06 17:54           ` Andrea Arcangeli
  2009-02-06 18:38             ` Andrea Arcangeli
  2009-02-07 13:32             ` Izik Eidus
  0 siblings, 2 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-06 17:54 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel, Nick Piggin, Izik Eidus, Hugh Dickins

On Wed, Feb 04, 2009 at 03:41:53PM -0800, Greg KH wrote:
> On Mon, Feb 02, 2009 at 11:08:56PM +0100, Andrea Arcangeli wrote:
> > Hi Greg!
> > 
> > > Thanks for the pointers, I'll go read the thread and follow up there.
> > 
> > If you also run into this final fix is attached below. Porting to
> > mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> > notifiers to fix gup-fast... need to think more about it then I'll
> > post something.
> > 
> > Please help testing the below on pre-gup-fast kernels, thanks!
> 
> What kernel version was this patch against?  I can't seem to apply it to
> anything close to mainline :(
> 
> I tried 2.6.29-rc3, 2.6.28-stable and 2.6.27-stable.

Ah patch is against 2.6.18, it's a "production" targeted patch as this
race is hitting production systems. So this is the final update
addressing hugetlb too. Kame was totally right hugetlb has the same
bug and I reproduced it trivially with:

LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes
HUGETLB_PATH=/mnt/huge/ ./dma_thread -a 512 -w 40
LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes
HUGETLB_PATH=/mnt/huge/ ./forkscrew 

I verfied hugetlb pages were allocated in grep Huge /proc/meminfo and
as well through printk in the hugetlb 'forcecow' path to be sure it
run and fixed the race. With new patch the bug goes away for both
largepages and regular anon pages.

I also exercised the -EAGAIN path and verfied it rollbacks and restart
on the prev pte just fine, it only happens during very heavy paging
activity.  I unlock/relock around the page-copy to be more lowlatency
schedule friendly and to move the page allocation out of the fork fast
path.

Here the latest patch. Now that I consider the 'production' trouble
closed I'll be porting it to mainline while addressing gup-fast too
which is an additional complication I didn't have to take care of
yet. So expect a patch that works for you in the next few days, either
that or complains about an unfixable gup-fast ;). But frankly I've
been thinking it should be possible in a much simpler way that I
ever thought before, by entirely relaying on the tlb flush.

In short if I simply do the page-count vs page-mapcount check _after_
ptep_set_wrprotect (which implies a tlb flush and after that any
gup-fast running in other cpus should be forced through the slow path
and block) I think I'm done. The code now does:

    check mapcount
    ptep_set_wrprotect

I think to make the thing working with gup-fast I've only to
ptep_set_wrprotect before the mapcount check.

The reason why the normal pagetable walking of the cpu is ok with
current code is that ptep_set_wrprotect done later will block any
access to the page from the other cpus. Not so if it was gup-fast
taking reference on the page. So we need to stop with
ptep_set_wrprotect any subsequential gup-fast _before_ we check count
vs mapcount and the fact the get_page is run inside the critical
section with local irq disabled in gup-fast should close the race for
gup-fast too. Will try that ASAP...

With Hugh I'm also reviewing the old 2.4 bug where a thread was doing
cow on swapcache while another thread was starting O_DIRECT, as I
recall to be 100% safe I wanted ptes pointing to pages under O_DIRECT
not to be unmapped by rmap, did that in 2.4 with a PG_pinned to tell
vmscan.c to bailout on the page. In 2.6 I wanted to do it with
mapcount but I think some alternate fix went in, so given I'm at it
I'll review the correctness of that too, just in case ;).

--------
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: fork-o_direct-race

Think a thread writing constantly to the last 512bytes of a page, while another
thread read and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
the very moment we mark any pte wrprotected because a third unrelated thread
forks off a child.

This fixes it by never wprotecting anon ptes if there can be any direct I/O in
flight to the page, and by instantiating a readonly pte and triggering a COW in
the child. The only trouble here are O_DIRECT reads (writes to memory, read
from disk). Checking the page_count under the PT lock guarantees no
get_user_pages could be running under us because if somebody wants to write to
the page, it has to break any cow first and that requires taking the PT lock in
follow_page before increasing the page count. We are guaranteed mapcount is 1 if
fork is writeprotecting the pte so the PT lock is enough to serialize against
get_user_pages->get_page.

The COW triggered inside fork will run while the parent pte is readonly to
provide as usual the per-page atomic copy from parent to child during fork.
However timings will be altered by having to copy the pages that might be under
O_DIRECT.

The pagevec code calls get_page while the page is sitting in the pagevec
(before it becomes PageLRU) and doing so it can generate false positives, so to
avoid slowing down fork all the time even for pages that could never possibly
be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
most overhead of the fix in fork.

Patch doesn't break kABI despite introducing a new page flag.

Fixed version of original patch from Nick Piggin.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 63c8e4c..6105cad 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -86,6 +86,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_gup			20	/* Page pin may be because of gup */
 #define PG_xpmem		27	/* Testing for xpmem. */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
@@ -239,6 +240,10 @@
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define SetPageGUP(page)	set_bit(PG_gup, &(page)->flags)
+#define PageGUP(page)		test_bit(PG_gup, &(page)->flags)
+#define __ClearPageGUP(page)	__clear_bit(PG_gup, &(page)->flags)
+
 /*
  * PG_reclaim is used in combination with PG_compound to mark the
  * head and tail of a compound page
diff --git a/kernel/fork.c b/kernel/fork.c
index f038aff..c29da4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -384,7 +384,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e80a916..f196ec8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -357,13 +357,17 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
 }
 
 
+static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long address, pte_t *ptep, pte_t pte,
+		       int cannot_race);
+
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *vma)
 {
-	pte_t *src_pte, *dst_pte, entry;
+	pte_t *src_pte, *dst_pte, entry, orig_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	int cow;
+	int cow, forcecow, oom;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
@@ -377,18 +381,46 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		/* if the page table is shared dont copy or take references */
 		if (dst_pte == src_pte)
 			continue;
+		oom = 0;
 		spin_lock(&dst->page_table_lock);
 		spin_lock(&src->page_table_lock);
-		if (!huge_pte_none(huge_ptep_get(src_pte))) {
-			if (cow)
-				huge_ptep_set_wrprotect(src, addr, src_pte);
-			entry = huge_ptep_get(src_pte);
+		orig_entry = entry = huge_ptep_get(src_pte);
+		if (!huge_pte_none(entry)) {
+			forcecow = 0;
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			if (cow && pte_write(entry)) {
+				if (PageGUP(ptepage))
+					forcecow = 1;
+				huge_ptep_set_wrprotect(src, addr,
+							src_pte);
+				entry = huge_ptep_get(src_pte);
+			}
 			set_huge_pte_at(dst, addr, dst_pte, entry);
+			if (forcecow) {
+				int cow_ret;
+				/*
+				 * We hold mmap_sem in write mode and
+				 * the VM doesn't know about hugepages
+				 * so the src_pte/dst_pte can't change
+				 * from under us even if hugetlb_cow
+				 * will release the lock.
+				 */
+				cow_ret = hugetlb_cow(dst, vma, addr,
+						      dst_pte, entry, 1);
+				BUG_ON(!pte_same(huge_ptep_get(src_pte),
+						 entry));
+				set_huge_pte_at(src, addr,
+						src_pte,
+						orig_entry);
+				if (cow_ret != VM_FAULT_MINOR)
+					oom = 1;
+			}
 		}
 		spin_unlock(&src->page_table_lock);
 		spin_unlock(&dst->page_table_lock);
+		if (oom)
+			goto nomem;
 	}
 	return 0;
 
@@ -448,7 +480,8 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, pte_t pte)
+		       unsigned long address, pte_t *ptep, pte_t pte,
+		       int cannot_race)
 {
 	struct page *old_page, *new_page;
 	int avoidcopy;
@@ -483,7 +516,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 				make_huge_pte(vma, new_page, 1));
 		/* Make the old page be freed below */
 		new_page = old_page;
-	}
+	} else
+		BUG_ON(cannot_race);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return VM_FAULT_MINOR;
@@ -553,7 +587,7 @@ retry:
 
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, new_pte);
+		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, 0);
 	}
 
 	spin_unlock(&mm->page_table_lock);
@@ -600,7 +634,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
 		if (write_access && !pte_write(entry))
-			ret = hugetlb_cow(mm, vma, address, ptep, entry);
+			ret = hugetlb_cow(mm, vma, address, ptep, entry, 0);
 	spin_unlock(&mm->page_table_lock);
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
@@ -649,6 +683,8 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			get_page(page);
+			if (write && !PageGUP(page))
+				SetPageGUP(page);
 			pages[i] = page + pfn_offset;
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 12c2d9f..c85c854 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -426,7 +426,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -434,6 +434,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int forcecow = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -464,15 +465,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = *src_pte;
-	}
-
-	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
 	 */
@@ -484,13 +476,44 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
+		if (is_cow_mapping(vm_flags) && pte_write(pte) &&
+		    PageAnon(page) && PageGUP(page)) {
+			if (unlikely(TestSetPageLocked(page)))
+				forcecow = 1;
+			else {
+				BUG_ON(page_mapcount(page) != 2);
+				if (unlikely(page_count(page) !=
+					     page_mapcount(page)
+					     + !!PageSwapCache(page)))
+					forcecow = 1;
+				unlock_page(page);
+			}
+		}
 		rss[!!PageAnon(page)]++;
 	}
 
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child.
+	 */
+	if (is_cow_mapping(vm_flags)) {
+		if (pte_write(pte)) {
+			ptep_set_wrprotect(src_mm, addr, src_pte);
+			pte = pte_wrprotect(pte);
+		}
+	}
+
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return forcecow;
 }
 
+static int fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address,
+			pte_t *src_pte, pte_t *dst_pte,
+			spinlock_t *src_ptl, spinlock_t *dst_ptl);
+
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
@@ -499,8 +522,10 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int forcecow;
 
 again:
+	forcecow = 0;
 	rss[1] = rss[0] = 0;
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -510,6 +535,9 @@ again:
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
 	do {
+		if (forcecow)
+			break;
+
 		/*
 		 * We are holding two locks at this point - either of them
 		 * could generate latencies in another task on another CPU.
@@ -525,15 +553,38 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(forcecow)) {
+		/*
+		 * Try to COW the child page as direct I/O is working
+		 * on the parent page, and so we've to mark the parent
+		 * pte read-write before dropping the PT lock to avoid
+		 * the page to be cowed in the parent and any direct
+		 * I/O to get lost.
+		 */
+		forcecow = fork_pre_cow(dst_mm, vma, addr-PAGE_SIZE,
+					src_pte-1, dst_pte-1,
+					src_ptl, dst_ptl);
+		/* after the page copy set the parent pte writeable again */
+		set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1,
+			   pte_mkwrite(*(src_pte-1)));
+		if (unlikely(forcecow == -EAGAIN)) {
+			dst_pte--;
+			src_pte--;
+			addr -= PAGE_SIZE;
+		}
+	}
+
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
 	cond_resched();
+	if (unlikely(forcecow == -ENOMEM))
+		return -ENOMEM;
 	if (addr != end)
 		goto again;
 	return 0;
@@ -957,8 +1008,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	if (unlikely(!page))
 		goto bad_page;
 
-	if (flags & FOLL_GET)
+	if (flags & FOLL_GET) {
 		get_page(page);
+		if (PageAnon(page) && !PageGUP(page))
+			SetPageGUP(page);
+	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1638,6 +1692,66 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 	copy_user_highpage(dst, src, va);
 }
 
+static int fork_pre_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address,
+			pte_t *src_pte, pte_t *dst_pte,
+			spinlock_t *src_ptl, spinlock_t *dst_ptl)
+{
+	pte_t _src_pte, _dst_pte;
+	struct page *old_page, *new_page;
+
+	_src_pte = *src_pte;
+	_dst_pte = *dst_pte;
+	old_page = vm_normal_page(vma, address, *src_pte);
+	BUG_ON(!old_page);
+	get_page(old_page);
+	spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
+
+	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+	if (!new_page)
+		return -ENOMEM;
+	cow_user_page(new_page, old_page, address);
+
+	spin_lock(dst_ptl);
+	spin_lock(src_ptl);
+
+	/*
+	 * src pte can unmapped by the VM from under us after dropping
+	 * the src_ptl but it can't be cowed from under us as fork
+	 * holds the mmap_sem in write mode.
+	 */
+	if (!pte_same(*src_pte, _src_pte))
+		goto eagain;
+	if (!pte_same(*dst_pte, _dst_pte))
+		goto eagain;
+
+	page_remove_rmap(old_page);
+	page_cache_release(old_page);
+	page_cache_release(old_page);
+
+	flush_cache_page(vma, address, pte_pfn(*src_pte));
+	_dst_pte = mk_pte(new_page, vma->vm_page_prot);
+	_dst_pte = maybe_mkwrite(pte_mkdirty(_dst_pte), vma);
+	page_add_new_anon_rmap(new_page, vma, address);
+	lru_cache_add_active(new_page);
+	set_pte_at(mm, address, dst_pte, _dst_pte);
+	update_mmu_cache(vma, address, _dst_pte);
+	lazy_mmu_prot_update(_dst_pte);
+	return 0;
+
+eagain:
+	page_cache_release(old_page);
+	page_cache_release(new_page);
+	/*
+	 * Later we'll repeat the copy of this pte, so here we've to
+	 * undo the mapcount and page count taken in copy_one_pte.
+	 */
+	page_remove_rmap(old_page);
+	page_cache_release(old_page);
+	return -EAGAIN;
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3eed821..5bd4b7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -154,6 +154,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_gup |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -400,6 +401,8 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageGUP(page))
+		__ClearPageGUP(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -546,6 +549,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_gup |
 			1 << PG_buddy ))))
 		bad_page(page);
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-03  2:55             ` KAMEZAWA Hiroyuki
  2009-02-03  3:42               ` KAMEZAWA Hiroyuki
@ 2009-02-06 17:55               ` Andrea Arcangeli
  1 sibling, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-06 17:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg KH, KOSAKI Motohiro, mtk.manpages, linux-man, linux-kernel

On Tue, Feb 03, 2009 at 11:55:40AM +0900, KAMEZAWA Hiroyuki wrote:
> No need to make a patch for copy_hugetlb_page_range() ?
> IMHO, HugeTLB can be write-protected at fork().

Yeah that was entirely correct, thanks Kame! I reproduced it with
libhugetlbfs and verified it fixed it now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-06 17:54           ` Andrea Arcangeli
@ 2009-02-06 18:38             ` Andrea Arcangeli
  2009-02-07 13:32             ` Izik Eidus
  1 sibling, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-06 18:38 UTC (permalink / raw)
  To: Greg KH
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages, linux-man,
	linux-kernel, Nick Piggin, Izik Eidus, Hugh Dickins

On Fri, Feb 06, 2009 at 06:54:14PM +0100, Andrea Arcangeli wrote:
> +	if (is_cow_mapping(vm_flags)) {
> +		if (pte_write(pte)) {
> +			ptep_set_wrprotect(src_mm, addr, src_pte);
> +			pte = pte_wrprotect(pte);
> +		}

While working on the mainline version that will definitely require a
tlb flush here if forcecow/PageGUP is set, I just realized to provide
an atomic per-page copy in the fork_pre_cow an explicit tlb flush is
needed in the pre-gup-fast version too.

	if (is_cow_mapping(vm_flags)) {
		if (pte_write(pte)) {
			ptep_set_wrprotect(src_mm, addr, src_pte);
			pte = pte_wrprotect(pte);
			if (forcecow)
				flush_tlb_page(src_vma, addr);
		}
	}

However to flush the 'src_mm' I feel I can't pass the 'dst_vma' that
fork is passing to copy_page_range. OTOH the dst_vma is needed to be
passed to the fork_pre_cow which is why fork.c was changed in the
patch to pass dst_vma instead of src_vma.

So I think I want to avoid all further confusion if the 'vma' belongs
to the src_mm or the dst_mm by passing both src_vma, and dst_vma from
fork to copy_page_tables. In the pre-gup-fast version the tlb flush is
mostly a nitpick and it would never lead to any practical issue, but
for mostly theoretical reasons it may be good idea to have the
per-page atomic copy there too, comments?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-06 17:54           ` Andrea Arcangeli
  2009-02-06 18:38             ` Andrea Arcangeli
@ 2009-02-07 13:32             ` Izik Eidus
  2009-02-07 15:33               ` Andrea Arcangeli
  1 sibling, 1 reply; 25+ messages in thread
From: Izik Eidus @ 2009-02-07 13:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Greg KH, KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages,
	linux-man, linux-kernel, Nick Piggin, Hugh Dickins

Andrea Arcangeli wrote:
>
> Here the latest patch. Now that I consider the 'production' trouble
> closed I'll be porting it to mainline while addressing gup-fast too
> which is an additional complication I didn't have to take care of
> yet. So expect a patch that works for you in the next few days, either
> that or complains about an unfixable gup-fast ;). But frankly I've
> been thinking it should be possible in a much simpler way that I
> ever thought before, by entirely relaying on the tlb flush.
>
> In short if I simply do the page-count vs page-mapcount check _after_
> ptep_set_wrprotect (which implies a tlb flush and after that any
> gup-fast running in other cpus should be forced through the slow path
> and block) I think I'm done. The code now does:
>
>     check mapcount
>     ptep_set_wrprotect
>
> I think to make the thing working with gup-fast I've only to
> ptep_set_wrprotect before the mapcount check.
>
> The reason why the normal pagetable walking of the cpu is ok with
> current code is that ptep_set_wrprotect done later will block any
> access to the page from the other cpus. Not so if it was gup-fast
> taking reference on the page. So we need to stop with
> ptep_set_wrprotect any subsequential gup-fast _before_ we check count
> vs mapcount and the fact the get_page is run inside the critical
> section with local irq disabled in gup-fast should close the race for
> gup-fast too. Will try that ASAP...
>
>   
Hi Andrea,
Good approach, but I think that it isn't enough to just change the order 
of check mapcount and ptep_set_wrprotect(),
the reason is that gup_fast do the get_page() AFTER it fetched the pte, 
so we are opening here a tiny race:

cpu#1 do get_user_pages_fast and fetch the pte (it think the pte is 
writeable)
cpu#2 do ptep_set_wrprotect()
cpu#2 check the mapcount against pagecount (it think that everything is 
fine and continue)
cpu#1 only now do get_page()

Anyway this is minor issue that can be probably solved by just:
rechecking if the pte isnt read_only in gup_fast after we do the get_page()

Anyway sound like a great idea to fix this issue!
good work.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: open(2) says O_DIRECT works on 512 byte boundries?
  2009-02-07 13:32             ` Izik Eidus
@ 2009-02-07 15:33               ` Andrea Arcangeli
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2009-02-07 15:33 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Greg KH, KOSAKI Motohiro, KAMEZAWA Hiroyuki, mtk.manpages,
	linux-man, linux-kernel, Nick Piggin, Hugh Dickins

On Sat, Feb 07, 2009 at 03:32:03PM +0200, Izik Eidus wrote:
> we are opening here a tiny race:
>
> cpu#1 do get_user_pages_fast and fetch the pte (it think the pte is 
> writeable)
> cpu#2 do ptep_set_wrprotect()
> cpu#2 check the mapcount against pagecount (it think that everything is 
> fine and continue)
> cpu#1 only now do get_page()
>
> Anyway this is minor issue that can be probably solved by just:
> rechecking if the pte isnt read_only in gup_fast after we do the get_page()

Not needed, if I check page_count vs mapcount after marking the pte
readonly and after sending smp-tlb-flush there is no race.

> Anyway sound like a great idea to fix this issue!

The only problem I'm thinking now is the IPI flood that would be
generated if I send IPIs for every pte wrprotected in fork, that
sounds overkill. So to use the IPI fix I could have gup-fast take the
slow path first time around if PG_gup isn't set, and then only second
time take the lockless fast path when PG_gup is already set (PG_gup
gets set by follow_page under PT lock/mmap_sem read mode at
least). And fork/ksm would only send IPIs for PG_gup pages. However
that would make gup-fast slow the first time it runs on an
anonymous/hugetlb page with pte marked writeable and I'm unsure if
that's ok.

Otherwise we've to return to the plan of the slightly more complicated
fix.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-02-07 15:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-28 21:33 open(2) says O_DIRECT works on 512 byte boundries? Greg KH
2009-01-29  0:41 ` Robert Hancock
     [not found]   ` <20090129011758.GA26534@kroah.com>
2009-01-29  2:59     ` Michael Kerrisk
2009-01-29  3:13       ` Greg KH
2009-01-29 15:40         ` Jeff Moyer
2009-01-30  6:16           ` Greg KH
2009-01-29  5:13 ` KAMEZAWA Hiroyuki
2009-01-29  7:10   ` KOSAKI Motohiro
2009-01-30  6:17     ` Greg KH
2009-02-02 22:08       ` Andrea Arcangeli
2009-02-03  1:29         ` KAMEZAWA Hiroyuki
2009-02-03  2:31           ` Andrea Arcangeli
2009-02-03  2:55             ` KAMEZAWA Hiroyuki
2009-02-03  3:42               ` KAMEZAWA Hiroyuki
2009-02-06 17:55               ` Andrea Arcangeli
2009-02-03  3:50         ` Greg KH
2009-02-03 15:01           ` Andrea Arcangeli
2009-02-03  4:13         ` KAMEZAWA Hiroyuki
2009-02-03  4:38         ` KAMEZAWA Hiroyuki
2009-02-03 15:08           ` Andrea Arcangeli
2009-02-04 23:41         ` Greg KH
2009-02-06 17:54           ` Andrea Arcangeli
2009-02-06 18:38             ` Andrea Arcangeli
2009-02-07 13:32             ` Izik Eidus
2009-02-07 15:33               ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).