linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Describe race of direct read and fork for unaligned buffers
@ 2012-04-30  9:30 Jan Kara
  2012-04-30 13:41 ` Jeff Moyer
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Jan Kara @ 2012-04-30  9:30 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: LKML, linux-man, linux-mm, Jan Kara, mgorman, Jeff Moyer

This is a long standing problem (or a surprising feature) in our implementation
of get_user_pages() (used by direct IO). Since several attempts to fix it
failed (e.g.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
clear whether we really want to fix it given the costs, let's at least document
it.

CC: mgorman@suse.de
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---

--- a/man2/open.2	2012-04-27 00:07:51.736883092 +0200
+++ b/man2/open.2	2012-04-27 00:29:59.489892980 +0200
@@ -769,7 +769,12 @@
 and the file offset must all be multiples of the logical block size
 of the file system.
 Under Linux 2.6, alignment to 512-byte boundaries
-suffices.
+suffices. However, if the user buffer is not page aligned and direct read
+runs in parallel with a
+.BR fork (2)
+of the reader process, it may happen that the read data is split between
+pages owned by the original process and its child. Thus effectively read
+data is corrupted.
 .LP
 The
 .B O_DIRECT

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
@ 2012-04-30 13:41 ` Jeff Moyer
  2012-04-30 14:30 ` Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Jeff Moyer @ 2012-04-30 13:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michael Kerrisk, LKML, linux-man, linux-mm, mgorman

Jan Kara <jack@suse.cz> writes:

> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman@suse.de
> CC: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>
> --- a/man2/open.2	2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2	2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
>  and the file offset must all be multiples of the logical block size
>  of the file system.
>  Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
>  .LP
>  The
>  .B O_DIRECT

I think this sufficiently distills the problem.  Thanks, Jan.

Acked-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
  2012-04-30 13:41 ` Jeff Moyer
@ 2012-04-30 14:30 ` Mel Gorman
  2012-05-01  5:50 ` Michael Kerrisk (man-pages)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2012-04-30 14:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michael Kerrisk, LKML, linux-man, linux-mm, Jeff Moyer

On Mon, Apr 30, 2012 at 11:30:07AM +0200, Jan Kara wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
> 
> CC: mgorman@suse.de
> CC: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
  2012-04-30 13:41 ` Jeff Moyer
  2012-04-30 14:30 ` Mel Gorman
@ 2012-05-01  5:50 ` Michael Kerrisk (man-pages)
  2012-05-01  6:49 ` Nick Piggin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-05-01  5:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-man, linux-mm, mgorman, Jeff Moyer

Jan,

On Mon, Apr 30, 2012 at 9:30 PM, Jan Kara <jack@suse.cz> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman@suse.de
> CC: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>
> --- a/man2/open.2       2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2       2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
>  and the file offset must all be multiples of the logical block size
>  of the file system.
>  Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
>  .LP
>  The
>  .B O_DIRECT

Thanks. I tweaked the patch slightly, and applied as below.

Cheers,

Michael

--- a/man2/open.2
+++ b/man2/open.2
@@ -49,7 +49,7 @@
 .\" FIXME Linux 2.6.33 has O_DSYNC, and a hidden __O_SYNC.
 .\" FIXME: Linux 2.6.39 added O_PATH
 .\"
-.TH OPEN 2 2012-02-27 "Linux" "Linux Programmer's Manual"
+.TH OPEN 2 2012-05-01 "Linux" "Linux Programmer's Manual"
 .SH NAME
 open, creat \- open and possibly create a file or device
 .SH SYNOPSIS
@@ -768,8 +768,13 @@ operation in
 Under Linux 2.4, transfer sizes, and the alignment of the user buffer
 and the file offset must all be multiples of the logical block size
 of the file system.
-Under Linux 2.6, alignment to 512-byte boundaries
-suffices.
+Under Linux 2.6, alignment to 512-byte boundaries suffices.
+However, if the user buffer is not page-aligned and the direct read
+runs in parallel with a
+.BR fork (2)
+of the reader process, it may happen that the read data is split between
+pages owned by the original process and its child.
+Thus the read data is effectively corrupted.
 .LP
 The
 .B O_DIRECT


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
                   ` (2 preceding siblings ...)
  2012-05-01  5:50 ` Michael Kerrisk (man-pages)
@ 2012-05-01  6:49 ` Nick Piggin
  2012-05-01 14:31 ` KOSAKI Motohiro
  2012-05-01 16:15 ` KOSAKI Motohiro
  5 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2012-05-01  6:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, Jeff Moyer

On 30 April 2012 19:30, Jan Kara <jack@suse.cz> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.

In any case, it should be documented even if it is ever fixed in newer
kernels. Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
                   ` (3 preceding siblings ...)
  2012-05-01  6:49 ` Nick Piggin
@ 2012-05-01 14:31 ` KOSAKI Motohiro
  2012-05-01 14:37   ` KOSAKI Motohiro
  2012-05-01 16:15 ` KOSAKI Motohiro
  5 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-01 14:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, Jeff Moyer, npiggin

On Mon, Apr 30, 2012 at 5:30 AM, Jan Kara <jack@suse.cz> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman@suse.de
> CC: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>
> --- a/man2/open.2       2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2       2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
>  and the file offset must all be multiples of the logical block size
>  of the file system.
>  Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
>  .LP
>  The
>  .B O_DIRECT

Hello,

Thank you revisit this. But as far as my remember is correct, this issue is NOT
unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
DIRECT_IO w/ multi thread process should not use fork().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 14:31 ` KOSAKI Motohiro
@ 2012-05-01 14:37   ` KOSAKI Motohiro
  2012-05-01 15:11     ` Jeff Moyer
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-01 14:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, Jeff Moyer, npiggin

> Hello,
>
> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> DIRECT_IO w/ multi thread process should not use fork().

The problem is, fork (and its COW logic) assume new access makes cow break,
But page table protection can't detect a DMA write. Therefore DIO may override
shared page data.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 14:37   ` KOSAKI Motohiro
@ 2012-05-01 15:11     ` Jeff Moyer
  2012-05-01 15:34       ` KOSAKI Motohiro
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff Moyer @ 2012-05-01 15:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jan Kara, Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, npiggin

KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:

>> Hello,
>>
>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>> DIRECT_IO w/ multi thread process should not use fork().
>
> The problem is, fork (and its COW logic) assume new access makes cow break,
> But page table protection can't detect a DMA write. Therefore DIO may override
> shared page data.

Hm, I've only seen this with misaligned or multiple sub-page-sized reads
in the same page.  AFAIR, aligned, page-sized I/O does not get split.
But, I could be wrong...

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 15:11     ` Jeff Moyer
@ 2012-05-01 15:34       ` KOSAKI Motohiro
  2012-05-01 15:38         ` Jeff Moyer
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-01 15:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, npiggin

On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
>
>>> Hello,
>>>
>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>> DIRECT_IO w/ multi thread process should not use fork().
>>
>> The problem is, fork (and its COW logic) assume new access makes cow break,
>> But page table protection can't detect a DMA write. Therefore DIO may override
>> shared page data.
>
> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
> But, I could be wrong...

If my remember is correct, the reproducer of past thread is misleading.

dma_thread.c in
http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
align parameter. But it doesn't only change align. Because of, every
worker thread read 4K (pagesize), then
 - when offset is page aligned
    -> every page is accessed from only one worker
 - when offset is not page aligned
    -> every page is accessed from two workers

But I don't remember why two threads are important things. hmm.. I'm
looking into the code a while.
Please don't 100% trust me.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 15:34       ` KOSAKI Motohiro
@ 2012-05-01 15:38         ` Jeff Moyer
  2012-05-01 15:50           ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff Moyer @ 2012-05-01 15:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jan Kara, Michael Kerrisk, LKML, linux-man, linux-mm, mgorman,
	npiggin, Andrea Arcangeli, Woodman

KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:

> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
>>
>>>> Hello,
>>>>
>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>
>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>> shared page data.
>>
>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>> But, I could be wrong...
>
> If my remember is correct, the reproducer of past thread is misleading.
>
> dma_thread.c in
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> align parameter. But it doesn't only change align. Because of, every
> worker thread read 4K (pagesize), then
>  - when offset is page aligned
>     -> every page is accessed from only one worker
>  - when offset is not page aligned
>     -> every page is accessed from two workers
>
> But I don't remember why two threads are important things. hmm.. I'm
> looking into the code a while.
> Please don't 100% trust me.

I bet Andrea or Larry would remember the details.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 15:38         ` Jeff Moyer
@ 2012-05-01 15:50           ` Nick Piggin
  2012-05-01 23:51             ` Andrea Arcangeli
  2012-05-02  8:17             ` Jan Kara
  0 siblings, 2 replies; 31+ messages in thread
From: Nick Piggin @ 2012-05-01 15:50 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: KOSAKI Motohiro, Jan Kara, Michael Kerrisk, LKML, linux-man,
	linux-mm, mgorman, Andrea Arcangeli, Woodman

On 2 May 2012 01:38, Jeff Moyer <jmoyer@redhat.com> wrote:
> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
>
>> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
>>> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
>>>
>>>>> Hello,
>>>>>
>>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>>
>>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>>> shared page data.
>>>
>>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>>> But, I could be wrong...
>>
>> If my remember is correct, the reproducer of past thread is misleading.
>>
>> dma_thread.c in
>> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
>> align parameter. But it doesn't only change align. Because of, every
>> worker thread read 4K (pagesize), then
>>  - when offset is page aligned
>>     -> every page is accessed from only one worker
>>  - when offset is not page aligned
>>     -> every page is accessed from two workers
>>
>> But I don't remember why two threads are important things. hmm.. I'm
>> looking into the code a while.
>> Please don't 100% trust me.
>
> I bet Andrea or Larry would remember the details.

KOSAKI-san is correct, I think.

The race is something like this:

DIO-read
    page = get_user_pages()
                                                        fork()
                                                            COW(page)
                                                         touch(page)
    DMA(page)
    page_cache_release(page);

So whether parent or child touches the page, determines who gets the
actual DMA target, and who gets the copy.

2 threads are not required, but it makes the race easier to code and a
larger window, I suspect.

It can also be hit with a single thread, using AIO.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
                   ` (4 preceding siblings ...)
  2012-05-01 14:31 ` KOSAKI Motohiro
@ 2012-05-01 16:15 ` KOSAKI Motohiro
  2012-05-01 17:56   ` Michael Kerrisk (man-pages)
  5 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-01 16:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michael Kerrisk, LKML, linux-man, linux-mm, mgorman, Jeff Moyer

> +suffices. However, if the user buffer is not page aligned and direct read

One more thing. direct write also makes data corruption. Think
following scenario,

1) P1-T1 uses DIO write (and starting dma)
2) P1-T2 call fork() and makes P2
3) P1-T3 write to the dio target page. and then, cow break occur and
original dio target
    pages is now owned by P2.
4) P2 write the dio target page. It now does NOT make cow break. and
now we break
    dio target page data.
5) DMA transfer write invalid data to disk.

The detail is described in your refer URLs.


> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
>  .LP
>  The
>  .B O_DIRECT

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 16:15 ` KOSAKI Motohiro
@ 2012-05-01 17:56   ` Michael Kerrisk (man-pages)
  2012-05-02  0:34     ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-05-01 17:56 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Jan Kara, LKML, linux-man, linux-mm, mgorman, Jeff Moyer

On Wed, May 2, 2012 at 4:15 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
>> +suffices. However, if the user buffer is not page aligned and direct read
>
> One more thing. direct write also makes data corruption. Think
> following scenario,

In the light of all of the comments, can someone revise the man-pages
patch that Jan sent?

Thanks,

Michael


> 1) P1-T1 uses DIO write (and starting dma)
> 2) P1-T2 call fork() and makes P2
> 3) P1-T3 write to the dio target page. and then, cow break occur and
> original dio target
>    pages is now owned by P2.
> 4) P2 write the dio target page. It now does NOT make cow break. and
> now we break
>    dio target page data.
> 5) DMA transfer write invalid data to disk.
>
> The detail is described in your refer URLs.
>
>
>> +runs in parallel with a
>> +.BR fork (2)
>> +of the reader process, it may happen that the read data is split between
>> +pages owned by the original process and its child. Thus effectively read
>> +data is corrupted.
>>  .LP
>>  The
>>  .B O_DIRECT



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 15:50           ` Nick Piggin
@ 2012-05-01 23:51             ` Andrea Arcangeli
  2012-05-02  8:17             ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2012-05-01 23:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Moyer, KOSAKI Motohiro, Jan Kara, Michael Kerrisk, LKML,
	linux-man, linux-mm, mgorman, Woodman

Hi Nick!

On Wed, May 02, 2012 at 01:50:46AM +1000, Nick Piggin wrote:
> KOSAKI-san is correct, I think.
> 
> The race is something like this:
> 
> DIO-read
>     page = get_user_pages()
>                                                         fork()
>                                                             COW(page)
>                                                          touch(page)
>     DMA(page)
>     page_cache_release(page);

Yes. More in general this race happens every time the kernel wrprotect
a writable anon pte, if get_user_pages had a pin on the page while the
pte is being wrprotected.

fork can't just abort (like KSM does) when it notices mapcount <
page_count.

The only way to avoid this, is that somehow the GUP-pinned page should
remain pointed at all times by the pte of the process that pinned the
page (no matter the cows), and that's not happening.

> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.

Correct, so far there are two reproducers, triggering two different
kind of corruption.

The corruption may appear in different ways:

1) we could lose the direct-io read in the parent (if the forked child
does nothing and just quits), that was the basic case in dma_thread.c,
a dummy fork was run just to mark the pte wrprotected

2) the destination of the direct-io read may also become visible to the
child if the child written to the page before the I/O is complete,
leading to random mm corruption in the child

3) it's a direct-io write, then the child could write random data to
disk by accident without noticing, if the DMA wasn't started yet and
the child got the pinned page mapped in the child pte

We had two working fixes for this and personally I'd prefer to apply
them than to document the bug. The probability that who writes code
that can hit the bug is reading the note in the manpage seems pretty
small, especially in the short/mid term. This lkml thread as reminder
may actually have higher chance of being noticed than the manpage
maybe. Nevertheless documenting it is better than nothing if the fixes
aren't applied :). However I'm afraid after we officially document it
the chances of fixing it becomes zero.

> 2 threads are not required, but it makes the race easier to code and a
> larger window, I suspect.
> 
> It can also be hit with a single thread, using AIO.

Yes, it requires running fork in the same process that pinned a page
with GUP, and then writing to a buffer in the same page that is under
the GUP pin before the GUP pin is released.

It's not just direct-io, and not just direct-io read (see point 3).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 17:56   ` Michael Kerrisk (man-pages)
@ 2012-05-02  0:34     ` Nick Piggin
  2012-05-02  3:04       ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2012-05-02  0:34 UTC (permalink / raw)
  To: mtk.manpages
  Cc: KOSAKI Motohiro, Jan Kara, LKML, linux-man, linux-mm, mgorman,
	Jeff Moyer

On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On Wed, May 2, 2012 at 4:15 AM, KOSAKI Motohiro
> <kosaki.motohiro@gmail.com> wrote:
>>> +suffices. However, if the user buffer is not page aligned and direct read
>>
>> One more thing. direct write also makes data corruption. Think
>> following scenario,
>
> In the light of all of the comments, can someone revise the man-pages
> patch that Jan sent?

This does not quite describe the entire situation, but something understandable
to developers:

O_DIRECT IOs should never be run concurrently with fork(2) system call,
when the memory buffer is anonymous memory, or comes from mmap(2)
with MAP_PRIVATE.

Any such IOs, whether submitted with asynchronous IO interface or from
another thread in the process, should be quiesced before fork(2) is called.
Failure to do so can result in data corruption and undefined behavior in
parent and child processes.

This restriction does not apply when the memory buffer for the O_DIRECT
IOs comes from mmap(2) with MAP_SHARED or from shmat(2).



Is that on the right track? I feel it might be necessary to describe this
allowance for MAP_SHARED, because some databases may be doing
such things, and anyway it gives apps a potential way to make this work
if concurrent fork + DIO is very important.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  0:34     ` Nick Piggin
@ 2012-05-02  3:04       ` Hugh Dickins
  2012-05-02  3:10         ` Nick Piggin
  2012-05-02  9:20         ` Jan Kara
  0 siblings, 2 replies; 31+ messages in thread
From: Hugh Dickins @ 2012-05-02  3:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mtk.manpages, KOSAKI Motohiro, Jan Kara, LKML, linux-man,
	linux-mm, mgorman, Jeff Moyer

On Wed, 2 May 2012, Nick Piggin wrote:
> On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >
> > In the light of all of the comments, can someone revise the man-pages
> > patch that Jan sent?
> 
> This does not quite describe the entire situation, but something understandable
> to developers:
> 
> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> when the memory buffer is anonymous memory, or comes from mmap(2)
> with MAP_PRIVATE.
> 
> Any such IOs, whether submitted with asynchronous IO interface or from
> another thread in the process, should be quiesced before fork(2) is called.
> Failure to do so can result in data corruption and undefined behavior in
> parent and child processes.
> 
> This restriction does not apply when the memory buffer for the O_DIRECT
> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).

Nor does this restriction apply when the memory buffer has been advised
as MADV_DONTFORK with madvise(2), ensuring that it will not be available
to the child after fork(2).

> 
> 
> 
> Is that on the right track? I feel it might be necessary to describe this
> allowance for MAP_SHARED, because some databases may be doing
> such things, and anyway it gives apps a potential way to make this work
> if concurrent fork + DIO is very important.

Looks good, but we do need a reference to MADV_DONTFORK, perhaps as above.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  3:04       ` Hugh Dickins
@ 2012-05-02  3:10         ` Nick Piggin
  2012-05-02  9:20         ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2012-05-02  3:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: mtk.manpages, KOSAKI Motohiro, Jan Kara, LKML, linux-man,
	linux-mm, mgorman, Jeff Moyer

On 2 May 2012 13:04, Hugh Dickins <hughd@google.com> wrote:
> On Wed, 2 May 2012, Nick Piggin wrote:
>> On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> >
>> > In the light of all of the comments, can someone revise the man-pages
>> > patch that Jan sent?
>>
>> This does not quite describe the entire situation, but something understandable
>> to developers:
>>
>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>> when the memory buffer is anonymous memory, or comes from mmap(2)
>> with MAP_PRIVATE.
>>
>> Any such IOs, whether submitted with asynchronous IO interface or from
>> another thread in the process, should be quiesced before fork(2) is called.
>> Failure to do so can result in data corruption and undefined behavior in
>> parent and child processes.
>>
>> This restriction does not apply when the memory buffer for the O_DIRECT
>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).

Yes of course, I forgot that was exported too.

>
>>
>>
>>
>> Is that on the right track? I feel it might be necessary to describe this
>> allowance for MAP_SHARED, because some databases may be doing
>> such things, and anyway it gives apps a potential way to make this work
>> if concurrent fork + DIO is very important.
>
> Looks good, but we do need a reference to MADV_DONTFORK, perhaps as above.

Yep, thanks Hugh.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-01 15:50           ` Nick Piggin
  2012-05-01 23:51             ` Andrea Arcangeli
@ 2012-05-02  8:17             ` Jan Kara
  2012-05-02  9:09               ` Nick Piggin
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Kara @ 2012-05-02  8:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeff Moyer, KOSAKI Motohiro, Jan Kara, Michael Kerrisk, LKML,
	linux-man, linux-mm, mgorman, Andrea Arcangeli, Woodman

On Wed 02-05-12 01:50:46, Nick Piggin wrote:
> On 2 May 2012 01:38, Jeff Moyer <jmoyer@redhat.com> wrote:
> > KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
> >
> >> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >>> KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:
> >>>
> >>>>> Hello,
> >>>>>
> >>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> >>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> >>>>> DIRECT_IO w/ multi thread process should not use fork().
> >>>>
> >>>> The problem is, fork (and its COW logic) assume new access makes cow break,
> >>>> But page table protection can't detect a DMA write. Therefore DIO may override
> >>>> shared page data.
> >>>
> >>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> >>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
> >>> But, I could be wrong...
> >>
> >> If my remember is correct, the reproducer of past thread is misleading.
> >>
> >> dma_thread.c in
> >> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> >> align parameter. But it doesn't only change align. Because of, every
> >> worker thread read 4K (pagesize), then
> >>  - when offset is page aligned
> >>     -> every page is accessed from only one worker
> >>  - when offset is not page aligned
> >>     -> every page is accessed from two workers
> >>
> >> But I don't remember why two threads are important things. hmm.. I'm
> >> looking into the code a while.
> >> Please don't 100% trust me.
> >
> > I bet Andrea or Larry would remember the details.
> 
> KOSAKI-san is correct, I think.
> 
> The race is something like this:
> 
> DIO-read
>     page = get_user_pages()
>                                                         fork()
>                                                             COW(page)
>                                                          touch(page)
>     DMA(page)
>     page_cache_release(page);
> 
> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.
  OK, this is roughly what I understood from original threads as well. So
if our buffer is page aligned and its size is page aligned, you would hit
the corruption only if you do modify the buffer while IO to / from that buffer
is in progress. And that would seem like a really bad programming practice
anyway. So I still believe that having everything page size aligned will
effectively remove the problem although I agree it does not aim at the core
of it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  8:17             ` Jan Kara
@ 2012-05-02  9:09               ` Nick Piggin
  2012-05-02  9:18                 ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2012-05-02  9:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, KOSAKI Motohiro, Michael Kerrisk, LKML, linux-man,
	linux-mm, mgorman, Andrea Arcangeli, Woodman

On 2 May 2012 18:17, Jan Kara <jack@suse.cz> wrote:
> On Wed 02-05-12 01:50:46, Nick Piggin wrote:

>> KOSAKI-san is correct, I think.
>>
>> The race is something like this:
>>
>> DIO-read
>>     page = get_user_pages()
>>                                                         fork()
>>                                                             COW(page)
>>                                                          touch(page)
>>     DMA(page)
>>     page_cache_release(page);
>>
>> So whether parent or child touches the page, determines who gets the
>> actual DMA target, and who gets the copy.
>  OK, this is roughly what I understood from original threads as well. So
> if our buffer is page aligned and its size is page aligned, you would hit
> the corruption only if you do modify the buffer while IO to / from that buffer
> is in progress. And that would seem like a really bad programming practice
> anyway. So I still believe that having everything page size aligned will
> effectively remove the problem although I agree it does not aim at the core
> of it.

I see what you mean.

I'm not sure, though. For most apps it's bad practice I think. If you get into
realm of sophisticated, performance critical IO/storage managers, it would
not surprise me if such concurrent buffer modifications could be allowed.
We allow exactly such a thing in our pagecache layer. Although probably
those would be using shared mmaps for their buffer cache.

I think it is safest to make a default policy of asking for IOs against private
cow-able mappings to be quiesced before fork, so there are no surprises
or reliance on COW details in the mm. Do you think?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  9:09               ` Nick Piggin
@ 2012-05-02  9:18                 ` Jan Kara
  2012-05-02 19:14                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2012-05-02  9:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Jeff Moyer, KOSAKI Motohiro, Michael Kerrisk, LKML,
	linux-man, linux-mm, mgorman, Andrea Arcangeli, Woodman

On Wed 02-05-12 19:09:54, Nick Piggin wrote:
> On 2 May 2012 18:17, Jan Kara <jack@suse.cz> wrote:
> > On Wed 02-05-12 01:50:46, Nick Piggin wrote:
> 
> >> KOSAKI-san is correct, I think.
> >>
> >> The race is something like this:
> >>
> >> DIO-read
> >>     page = get_user_pages()
> >>                                                         fork()
> >>                                                             COW(page)
> >>                                                          touch(page)
> >>     DMA(page)
> >>     page_cache_release(page);
> >>
> >> So whether parent or child touches the page, determines who gets the
> >> actual DMA target, and who gets the copy.
> >  OK, this is roughly what I understood from original threads as well. So
> > if our buffer is page aligned and its size is page aligned, you would hit
> > the corruption only if you do modify the buffer while IO to / from that buffer
> > is in progress. And that would seem like a really bad programming practice
> > anyway. So I still believe that having everything page size aligned will
> > effectively remove the problem although I agree it does not aim at the core
> > of it.
> 
> I see what you mean.
> 
> I'm not sure, though. For most apps it's bad practice I think. If you get into
> realm of sophisticated, performance critical IO/storage managers, it would
> not surprise me if such concurrent buffer modifications could be allowed.
> We allow exactly such a thing in our pagecache layer. Although probably
> those would be using shared mmaps for their buffer cache.
> 
> I think it is safest to make a default policy of asking for IOs against private
> cow-able mappings to be quiesced before fork, so there are no surprises
> or reliance on COW details in the mm. Do you think?
  Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
in documentation. Otherwise it's a bit too hairy...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  3:04       ` Hugh Dickins
  2012-05-02  3:10         ` Nick Piggin
@ 2012-05-02  9:20         ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2012-05-02  9:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, mtk.manpages, KOSAKI Motohiro, Jan Kara, LKML,
	linux-man, linux-mm, mgorman, Jeff Moyer

On Tue 01-05-12 20:04:15, Hugh Dickins wrote:
> On Wed, 2 May 2012, Nick Piggin wrote:
> > On 2 May 2012 03:56, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> > >
> > > In the light of all of the comments, can someone revise the man-pages
> > > patch that Jan sent?
> > 
> > This does not quite describe the entire situation, but something understandable
> > to developers:
> > 
> > O_DIRECT IOs should never be run concurrently with fork(2) system call,
> > when the memory buffer is anonymous memory, or comes from mmap(2)
> > with MAP_PRIVATE.
> > 
> > Any such IOs, whether submitted with asynchronous IO interface or from
> > another thread in the process, should be quiesced before fork(2) is called.
> > Failure to do so can result in data corruption and undefined behavior in
> > parent and child processes.
> > 
> > This restriction does not apply when the memory buffer for the O_DIRECT
> > IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
> 
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).
  Yes, I think with this addition the text is fine.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02  9:18                 ` Jan Kara
@ 2012-05-02 19:14                   ` KOSAKI Motohiro
  2012-05-02 19:23                     ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-02 19:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Jeff Moyer, KOSAKI Motohiro, Michael Kerrisk, LKML,
	linux-man, linux-mm, mgorman, Andrea Arcangeli, Woodman

Hello,

>> I see what you mean.
>>
>> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> realm of sophisticated, performance critical IO/storage managers, it would
>> not surprise me if such concurrent buffer modifications could be allowed.
>> We allow exactly such a thing in our pagecache layer. Although probably
>> those would be using shared mmaps for their buffer cache.
>>
>> I think it is safest to make a default policy of asking for IOs against private
>> cow-able mappings to be quiesced before fork, so there are no surprises
>> or reliance on COW details in the mm. Do you think?
>    Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> in documentation. Otherwise it's a bit too hairy...

I neglected this issue for years because Linus asked who need this and
I couldn't
find real world usecase.

Ah, no, not exactly correct. Fujitsu proprietary database had such
usecase. But they
quickly fixed it. Then I couldn't find alternative usecase.

I'm not sure why you say "hairy". Do you mean you have any use case of this?

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02 19:14                   ` KOSAKI Motohiro
@ 2012-05-02 19:23                     ` Jan Kara
  2012-05-02 19:25                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2012-05-02 19:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jan Kara, Nick Piggin, Jeff Moyer, Michael Kerrisk, LKML,
	linux-man, linux-mm, mgorman, Andrea Arcangeli, Woodman

On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
> Hello,
> 
> >> I see what you mean.
> >>
> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
> >> realm of sophisticated, performance critical IO/storage managers, it would
> >> not surprise me if such concurrent buffer modifications could be allowed.
> >> We allow exactly such a thing in our pagecache layer. Although probably
> >> those would be using shared mmaps for their buffer cache.
> >>
> >> I think it is safest to make a default policy of asking for IOs against private
> >> cow-able mappings to be quiesced before fork, so there are no surprises
> >> or reliance on COW details in the mm. Do you think?
> >    Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> > in documentation. Otherwise it's a bit too hairy...
> 
> I neglected this issue for years because Linus asked who need this and
> I couldn't
> find real world usecase.
> 
> Ah, no, not exactly correct. Fujitsu proprietary database had such
> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
  One of our customers hit this bug recently which is why I started to look
at this. But they also modified their application not to hit the problem.

> I'm not sure why you say "hairy". Do you mean you have any use case of this?
  I meant that if we should describe conditions like "if you have page
aligned buffer and you don't write to it while the IO is running, the
problem also won't occur", then it's already too detailed and might
easily change in future kernels...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02 19:23                     ` Jan Kara
@ 2012-05-02 19:25                       ` KOSAKI Motohiro
  2012-05-05 11:28                         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-02 19:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Jeff Moyer, Michael Kerrisk, LKML, linux-man,
	linux-mm, mgorman, Andrea Arcangeli, Woodman

On Wed, May 2, 2012 at 3:23 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>> Hello,
>>
>> >> I see what you mean.
>> >>
>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> >> realm of sophisticated, performance critical IO/storage managers, it would
>> >> not surprise me if such concurrent buffer modifications could be allowed.
>> >> We allow exactly such a thing in our pagecache layer. Although probably
>> >> those would be using shared mmaps for their buffer cache.
>> >>
>> >> I think it is safest to make a default policy of asking for IOs against private
>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>> >> or reliance on COW details in the mm. Do you think?
>> >    Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>> > in documentation. Otherwise it's a bit too hairy...
>>
>> I neglected this issue for years because Linus asked who need this and
>> I couldn't
>> find real world usecase.
>>
>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
>  One of our customers hit this bug recently which is why I started to look
> at this. But they also modified their application not to hit the problem.
>
>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
>  I meant that if we should describe conditions like "if you have page
> aligned buffer and you don't write to it while the IO is running, the
> problem also won't occur", then it's already too detailed and might
> easily change in future kernels...

ok, thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-02 19:25                       ` KOSAKI Motohiro
@ 2012-05-05 11:28                         ` Michael Kerrisk (man-pages)
  2012-05-05 15:29                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 31+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-05-05 11:28 UTC (permalink / raw)
  To: KOSAKI Motohiro, Nick Piggin
  Cc: Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm, mgorman,
	Andrea Arcangeli, Woodman

On Thu, May 3, 2012 at 7:25 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
> On Wed, May 2, 2012 at 3:23 PM, Jan Kara <jack@suse.cz> wrote:
>> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>>> Hello,
>>>
>>> >> I see what you mean.
>>> >>
>>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>>> >> realm of sophisticated, performance critical IO/storage managers, it would
>>> >> not surprise me if such concurrent buffer modifications could be allowed.
>>> >> We allow exactly such a thing in our pagecache layer. Although probably
>>> >> those would be using shared mmaps for their buffer cache.
>>> >>
>>> >> I think it is safest to make a default policy of asking for IOs against private
>>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>>> >> or reliance on COW details in the mm. Do you think?
>>> >    Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>>> > in documentation. Otherwise it's a bit too hairy...
>>>
>>> I neglected this issue for years because Linus asked who need this and
>>> I couldn't
>>> find real world usecase.
>>>
>>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
>>  One of our customers hit this bug recently which is why I started to look
>> at this. But they also modified their application not to hit the problem.
>>
>>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
>>  I meant that if we should describe conditions like "if you have page
>> aligned buffer and you don't write to it while the IO is running, the
>> problem also won't occur", then it's already too detailed and might
>> easily change in future kernels...

So, am I correct to assume that right text to add to the page is as below?

Nick, can you clarify what you mean by "quiesced"?

[[
O_DIRECT IOs should never be run concurrently with fork(2) system call,
when the memory buffer is anonymous memory, or comes from mmap(2)
with MAP_PRIVATE.

Any such IOs, whether submitted with asynchronous IO interface or from
another thread in the process, should be quiesced before fork(2) is called.
Failure to do so can result in data corruption and undefined behavior in
parent and child processes.

This restriction does not apply when the memory buffer for the O_DIRECT
IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
Nor does this restriction apply when the memory buffer has been advised
as MADV_DONTFORK with madvise(2), ensuring that it will not be available
to the child after fork(2).
]]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-05 11:28                         ` Michael Kerrisk (man-pages)
@ 2012-05-05 15:29                           ` KOSAKI Motohiro
  2012-05-08 23:10                             ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: KOSAKI Motohiro @ 2012-05-05 15:29 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Nick Piggin, Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm,
	mgorman, Andrea Arcangeli, Woodman

> So, am I correct to assume that right text to add to the page is as below?
>
> Nick, can you clarify what you mean by "quiesced"?

finished?

>
> [[
> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> when the memory buffer is anonymous memory, or comes from mmap(2)
> with MAP_PRIVATE.
>
> Any such IOs, whether submitted with asynchronous IO interface or from
> another thread in the process, should be quiesced before fork(2) is called.
> Failure to do so can result in data corruption and undefined behavior in
> parent and child processes.
>
> This restriction does not apply when the memory buffer for the O_DIRECT
> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).
> ]]

I don't have good English and I can't make editorial check. But at least,
I don't find any technical incorrect explanation here.

 Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-05 15:29                           ` KOSAKI Motohiro
@ 2012-05-08 23:10                             ` Nick Piggin
  2012-05-09  5:35                               ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2012-05-08 23:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: mtk.manpages, Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm,
	mgorman, Andrea Arcangeli, Woodman

On 6 May 2012 01:29, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>> So, am I correct to assume that right text to add to the page is as below?
>>
>> Nick, can you clarify what you mean by "quiesced"?
>
> finished?

Yes exactly. That might be a simpler word. Thanks!

>
>>
>> [[
>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>> when the memory buffer is anonymous memory, or comes from mmap(2)
>> with MAP_PRIVATE.
>>
>> Any such IOs, whether submitted with asynchronous IO interface or from
>> another thread in the process, should be quiesced before fork(2) is called.
>> Failure to do so can result in data corruption and undefined behavior in
>> parent and child processes.
>>
>> This restriction does not apply when the memory buffer for the O_DIRECT
>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>> Nor does this restriction apply when the memory buffer has been advised
>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>> to the child after fork(2).
>> ]]
>
> I don't have good English and I can't make editorial check. But at least,
> I don't find any technical incorrect explanation here.
>
>  Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-08 23:10                             ` Nick Piggin
@ 2012-05-09  5:35                               ` Michael Kerrisk (man-pages)
  2012-05-09  7:01                                 ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-05-09  5:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm,
	mgorman, Andrea Arcangeli, Woodman

On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <npiggin@gmail.com> wrote:
> On 6 May 2012 01:29, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>>> So, am I correct to assume that right text to add to the page is as below?
>>>
>>> Nick, can you clarify what you mean by "quiesced"?
>>
>> finished?
>
> Yes exactly. That might be a simpler word. Thanks!

Thanks.

But see below. I realize the text is still ambiguous.

>>> [[
>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>> with MAP_PRIVATE.
>>>
>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>> another thread in the process, should be quiesced before fork(2) is called.
>>> Failure to do so can result in data corruption and undefined behavior in
>>> parent and child processes.
>>>
>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>> Nor does this restriction apply when the memory buffer has been advised
>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>> to the child after fork(2).
>>> ]]

In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
unclear. The first paragraph implies that such a buffer is unsafe,
while the third paragraph implies that it *is* safe, thus
contradicting the first paragraph. Which is correct?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-09  5:35                               ` Michael Kerrisk (man-pages)
@ 2012-05-09  7:01                                 ` Nick Piggin
  2012-05-09  7:18                                   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2012-05-09  7:01 UTC (permalink / raw)
  To: mtk.manpages
  Cc: KOSAKI Motohiro, Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm,
	mgorman, Andrea Arcangeli, Woodman

On 9 May 2012 15:35, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <npiggin@gmail.com> wrote:
>> On 6 May 2012 01:29, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>>>> So, am I correct to assume that right text to add to the page is as below?
>>>>
>>>> Nick, can you clarify what you mean by "quiesced"?
>>>
>>> finished?
>>
>> Yes exactly. That might be a simpler word. Thanks!
>
> Thanks.
>
> But see below. I realize the text is still ambiguous.
>
>>>> [[
>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>>> with MAP_PRIVATE.
>>>>
>>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>>> another thread in the process, should be quiesced before fork(2) is called.
>>>> Failure to do so can result in data corruption and undefined behavior in
>>>> parent and child processes.
>>>>
>>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>>> Nor does this restriction apply when the memory buffer has been advised
>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>>> to the child after fork(2).
>>>> ]]
>
> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
> unclear. The first paragraph implies that such a buffer is unsafe,
> while the third paragraph implies that it *is* safe, thus
> contradicting the first paragraph. Which is correct?

Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
anonymous from the virtual memory subsystem's point of view. But that
just serves to confuse userspace I guess.

Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.

Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
dangerous. These type are used by standard heap allocators malloc,
new, etc.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-09  7:01                                 ` Nick Piggin
@ 2012-05-09  7:18                                   ` Michael Kerrisk (man-pages)
  2012-05-10 15:00                                     ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Michael Kerrisk (man-pages) @ 2012-05-09  7:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Jan Kara, Jeff Moyer, LKML, linux-man, linux-mm,
	mgorman, Andrea Arcangeli, Woodman

On Wed, May 9, 2012 at 7:01 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On 9 May 2012 15:35, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
>> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <npiggin@gmail.com> wrote:
>>> On 6 May 2012 01:29, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
>>>>> So, am I correct to assume that right text to add to the page is as below?
>>>>>
>>>>> Nick, can you clarify what you mean by "quiesced"?
>>>>
>>>> finished?
>>>
>>> Yes exactly. That might be a simpler word. Thanks!
>>
>> Thanks.
>>
>> But see below. I realize the text is still ambiguous.
>>
>>>>> [[
>>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>>>> with MAP_PRIVATE.
>>>>>
>>>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>>>> another thread in the process, should be quiesced before fork(2) is called.
>>>>> Failure to do so can result in data corruption and undefined behavior in
>>>>> parent and child processes.
>>>>>
>>>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>>>> Nor does this restriction apply when the memory buffer has been advised
>>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>>>> to the child after fork(2).
>>>>> ]]
>>
>> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
>> unclear. The first paragraph implies that such a buffer is unsafe,
>> while the third paragraph implies that it *is* safe, thus
>> contradicting the first paragraph. Which is correct?
>
> Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
> anonymous from the virtual memory subsystem's point of view. But that
> just serves to confuse userspace I guess.
>
> Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.
>
> Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
> dangerous. These type are used by standard heap allocators malloc,
> new, etc.

So, would the following text be okay:

       O_DIRECT I/Os should never be run concurrently with the fork(2)
       system call, if the memory buffer is a private  mapping  (i.e.,
       any  mapping  created  with  the mmap(2) MAP_PRIVATE flag; this
       includes memory allocated on the heap and statically  allocated
       buffers).  Any such I/Os, whether submitted via an asynchronous
       I/O interface or from another thread in the process, should  be
       completed  before  fork(2)  is  called.   Failure  to do so can
       result in data corruption and undefined behavior in parent  and
       child processes.  This restriction does not apply when the mem‐
       ory buffer for the O_DIRECT I/Os was created using shmat(2)  or
       mmap(2)  with  the  MAP_SHARED flag.  Nor does this restriction
       apply when the memory buffer has been advised as  MADV_DONTFORK
       with  madvise(2), ensuring that it will not be available to the
       child after fork(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Describe race of direct read and fork for unaligned buffers
  2012-05-09  7:18                                   ` Michael Kerrisk (man-pages)
@ 2012-05-10 15:00                                     ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2012-05-10 15:00 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Nick Piggin, KOSAKI Motohiro, Jan Kara, Jeff Moyer, LKML,
	linux-man, linux-mm, mgorman, Andrea Arcangeli, Woodman

On Wed 09-05-12 19:18:16, Michael Kerrisk (man-pages) wrote:
> On Wed, May 9, 2012 at 7:01 PM, Nick Piggin <npiggin@gmail.com> wrote:
> > On 9 May 2012 15:35, Michael Kerrisk (man-pages) <mtk.manpages@gmail.com> wrote:
> >> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <npiggin@gmail.com> wrote:
> >>> On 6 May 2012 01:29, KOSAKI Motohiro <kosaki.motohiro@gmail.com> wrote:
> >>>>> So, am I correct to assume that right text to add to the page is as below?
> >>>>>
> >>>>> Nick, can you clarify what you mean by "quiesced"?
> >>>>
> >>>> finished?
> >>>
> >>> Yes exactly. That might be a simpler word. Thanks!
> >>
> >> Thanks.
> >>
> >> But see below. I realize the text is still ambiguous.
> >>
> >>>>> [[
> >>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> >>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
> >>>>> with MAP_PRIVATE.
> >>>>>
> >>>>> Any such IOs, whether submitted with asynchronous IO interface or from
> >>>>> another thread in the process, should be quiesced before fork(2) is called.
> >>>>> Failure to do so can result in data corruption and undefined behavior in
> >>>>> parent and child processes.
> >>>>>
> >>>>> This restriction does not apply when the memory buffer for the O_DIRECT
> >>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
> >>>>> Nor does this restriction apply when the memory buffer has been advised
> >>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> >>>>> to the child after fork(2).
> >>>>> ]]
> >>
> >> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
> >> unclear. The first paragraph implies that such a buffer is unsafe,
> >> while the third paragraph implies that it *is* safe, thus
> >> contradicting the first paragraph. Which is correct?
> >
> > Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
> > anonymous from the virtual memory subsystem's point of view. But that
> > just serves to confuse userspace I guess.
> >
> > Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.
> >
> > Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
> > dangerous. These type are used by standard heap allocators malloc,
> > new, etc.
> 
> So, would the following text be okay:
> 
>        O_DIRECT I/Os should never be run concurrently with the fork(2)
>        system call, if the memory buffer is a private  mapping  (i.e.,
>        any  mapping  created  with  the mmap(2) MAP_PRIVATE flag; this
>        includes memory allocated on the heap and statically  allocated
>        buffers).  Any such I/Os, whether submitted via an asynchronous
>        I/O interface or from another thread in the process, should  be
>        completed  before  fork(2)  is  called.   Failure  to do so can
>        result in data corruption and undefined behavior in parent  and
>        child processes.  This restriction does not apply when the mem‐
>        ory buffer for the O_DIRECT I/Os was created using shmat(2)  or
>        mmap(2)  with  the  MAP_SHARED flag.  Nor does this restriction
>        apply when the memory buffer has been advised as  MADV_DONTFORK
>        with  madvise(2), ensuring that it will not be available to the
>        child after fork(2).
  This text looks OK, to me. Thanks for putting it together.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-05-10 15:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-30  9:30 [PATCH] Describe race of direct read and fork for unaligned buffers Jan Kara
2012-04-30 13:41 ` Jeff Moyer
2012-04-30 14:30 ` Mel Gorman
2012-05-01  5:50 ` Michael Kerrisk (man-pages)
2012-05-01  6:49 ` Nick Piggin
2012-05-01 14:31 ` KOSAKI Motohiro
2012-05-01 14:37   ` KOSAKI Motohiro
2012-05-01 15:11     ` Jeff Moyer
2012-05-01 15:34       ` KOSAKI Motohiro
2012-05-01 15:38         ` Jeff Moyer
2012-05-01 15:50           ` Nick Piggin
2012-05-01 23:51             ` Andrea Arcangeli
2012-05-02  8:17             ` Jan Kara
2012-05-02  9:09               ` Nick Piggin
2012-05-02  9:18                 ` Jan Kara
2012-05-02 19:14                   ` KOSAKI Motohiro
2012-05-02 19:23                     ` Jan Kara
2012-05-02 19:25                       ` KOSAKI Motohiro
2012-05-05 11:28                         ` Michael Kerrisk (man-pages)
2012-05-05 15:29                           ` KOSAKI Motohiro
2012-05-08 23:10                             ` Nick Piggin
2012-05-09  5:35                               ` Michael Kerrisk (man-pages)
2012-05-09  7:01                                 ` Nick Piggin
2012-05-09  7:18                                   ` Michael Kerrisk (man-pages)
2012-05-10 15:00                                     ` Jan Kara
2012-05-01 16:15 ` KOSAKI Motohiro
2012-05-01 17:56   ` Michael Kerrisk (man-pages)
2012-05-02  0:34     ` Nick Piggin
2012-05-02  3:04       ` Hugh Dickins
2012-05-02  3:10         ` Nick Piggin
2012-05-02  9:20         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).