From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: Jann Horn <jannh@google.com>, Hugh Dickins <hughd@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
sqazi@google.com, "Michael S. Tsirkin" <mst@redhat.com>,
jack@suse.cz, kernel list <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>
Subject: Re: [BUG] mm: direct I/O (using GUP) can write to COW anonymous pages
Date: Tue, 18 Sep 2018 12:13:56 +0300 [thread overview]
Message-ID: <5f794be8-f6f1-57f1-cb61-43e34cd6c4ed@yandex-team.ru> (raw)
In-Reply-To: <CAG48ez1hk5evqQpyvticPzLFOcESfo2NoWnqrLZk6N4PXwdsOw@mail.gmail.com>
On 18.09.2018 03:35, Jann Horn wrote:
> On Tue, Sep 18, 2018 at 2:05 AM Hugh Dickins <hughd@google.com> wrote:
>>
>> Hi Jann,
>>
>> On Mon, 17 Sep 2018, Jann Horn wrote:
>>
>>> [I'm not sure who the best people to ask about this are, I hope the
>>> recipient list resembles something reasonable...]
>>>
>>> I have noticed that the dup_mmap() logic on fork() doesn't handle
>>> pages with active direct I/O properly: dup_mmap() seems to assume that
>>> making the PTE referencing a page readonly will always prevent future
>>> writes to the page, but if the kernel has acquired a direct reference
>>> to the page before (e.g. via get_user_pages_fast()), writes can still
>>> happen that way.
>>>
>>> The worst-case effect of this - as far as I can tell - is that when a
>>> multithreaded process forks while one thread is in the middle of
>>> sys_read() on a file that uses direct I/O with get_user_pages_fast(),
>>> the read data can become visible in the child while the parent's
>>> buffer stays uninitialized if the parent writes to a relevant page
>>> post-fork before either the I/O completes or the child writes to it.
>>
>> Yes: you're understandably more worried by the one seeing the other's
>> data;
>
> Actually, I was mostly just trying to find a scenario in which the
> parent doesn't get the data it's asking for, and this is the simplest
> I could come up with. :)
This might happens even in single-threaded process:
io_submit()
fork()
<CoW same page>
io_getevents()
<no new data in page>
For example if buffer is only 512 bytes and share page with something else.
THP opens wider race window.
>
> I was also vaguely worried about whether some other part of the mm
> subsystem might assume that COW pages are immutable, but I haven't
> found anything like that so far, so that might've been unwarranted
> paranoia.
>
>> we've tended in the past to be more worried about the one getting
>> corruption, and the other not seeing the data it asked for (and usually
>> in the context of RDMA, rather than filesystem direct I/O).
>>
>> I've added some Cc's: I might be misremembering, but I think both
>> Andrea and Konstantin have offered approaches to this in the past,
>> and I believe Salman is taking a look at it currently.
>>
>> But my own interest ended when Michael added MADV_DONTFORK: beyond
>> that, we've rated it a "Patient: It hurts when I do this. Doctor:
>> Don't do that then" - more complexity and overhead to solve, than
>> we have had appetite to get into.
>
> Makes sense, I guess.
>
> I wonder whether there's a concise way to express this in the fork.2
> manpage, or something like that. Maybe I'll take a stab at writing
> something. The biggest issue I see with documenting this edgecase is
> that, as an application developer, if you don't know whether some file
> might be coming from a FUSE filesystem that has opted out of using the
> disk cache, the "don't do that" essentially becomes "don't read() into
> heap buffers while fork()ing in another thread", since with FUSE,
> direct I/O can happen even if you don't open files as O_DIRECT as long
> as the filesystem requests direct I/O, and get_user_pages_fast() will
> AFAIU be used for non-page-aligned buffers, meaning that an adjacent
> heap memory access could trigger CoW page duplication. But then, FUSE
> filesystems that opt out of the disk cache are probably so rare that
> it's not a concern in practice...
fork() from multithreaded process is almost always broken
because child gets locks held by other threads.
For example glibc localtime_r() is unsafe after such fork().
Unless child calls execve() right away it cannot do much.
But in this case vfork() works way much better and faster.
>
>> But not a shiningly satisfactory
>> situation, I'll agree.
>>
>> Hugh
>>
>>>
>>> Reproducer code:
>>>
>>> ====== START hello.c ======
>>> #define FUSE_USE_VERSION 26
>>>
>>> #include <fuse.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> #include <errno.h>
>>> #include <fcntl.h>
>>> #include <unistd.h>
>>> #include <err.h>
>>> #include <sys/uio.h>
>>>
>>> static const char *hello_path = "/hello";
>>>
>>> static int hello_getattr(const char *path, struct stat *stbuf)
>>> {
>>> int res = 0;
>>> memset(stbuf, 0, sizeof(struct stat));
>>> if (strcmp(path, "/") == 0) {
>>> stbuf->st_mode = S_IFDIR | 0755;
>>> stbuf->st_nlink = 2;
>>> } else if (strcmp(path, hello_path) == 0) {
>>> stbuf->st_mode = S_IFREG | 0666;
>>> stbuf->st_nlink = 1;
>>> stbuf->st_size = 0x1000;
>>> stbuf->st_blocks = 0;
>>> } else
>>> res = -ENOENT;
>>> return res;
>>> }
>>>
>>> static int hello_readdir(const char *path, void *buf, fuse_fill_dir_t
>>> filler, off_t offset, struct fuse_file_info *fi) {
>>> filler(buf, ".", NULL, 0);
>>> filler(buf, "..", NULL, 0);
>>> filler(buf, hello_path + 1, NULL, 0);
>>> return 0;
>>> }
>>>
>>> static int hello_open(const char *path, struct fuse_file_info *fi) {
>>> return 0;
>>> }
>>>
>>> static int hello_read(const char *path, char *buf, size_t size, off_t
>>> offset, struct fuse_file_info *fi) {
>>> sleep(3);
>>> size_t len = 0x1000;
>>> if (offset < len) {
>>> if (offset + size > len)
>>> size = len - offset;
>>> memset(buf, 0, size);
>>> } else
>>> size = 0;
>>> return size;
>>> }
>>>
>>> static int hello_write(const char *path, const char *buf, size_t size,
>>> off_t offset, struct fuse_file_info *fi) {
>>> while(1) pause();
>>> }
>>>
>>> static struct fuse_operations hello_oper = {
>>> .getattr = hello_getattr,
>>> .readdir = hello_readdir,
>>> .open = hello_open,
>>> .read = hello_read,
>>> .write = hello_write,
>>> };
>>>
>>> int main(int argc, char *argv[]) {
>>> return fuse_main(argc, argv, &hello_oper, NULL);
>>> }
>>> ====== END hello.c ======
>>>
>>> ====== START simple_mmap.c ======
>>> #define _GNU_SOURCE
>>> #include <pthread.h>
>>> #include <sys/mman.h>
>>> #include <err.h>
>>> #include <unistd.h>
>>> #include <fcntl.h>
>>> #include <stdio.h>
>>> #include <signal.h>
>>> #include <sys/prctl.h>
>>> #include <sys/wait.h>
>>>
>>> __attribute__((aligned(0x1000))) char data_buffer_[0x10000];
>>> #define data_buffer (data_buffer_ + 0x8000)
>>>
>>> void *fuse_thread(void *dummy) {
>>> /* step 2: start direct I/O on data_buffer */
>>> int fuse_fd = open("mount/hello", O_RDWR);
>>> if (fuse_fd == -1)
>>> err(1, "unable to open FUSE fd");
>>> printf("char in parent (before): %hhd\n", data_buffer[0]);
>>> int res = read(fuse_fd, data_buffer, 0x1000);
>>> /* step 6: read completes, show post-read state */
>>> printf("fuse read result: %d\n", res);
>>> printf("char in parent (after): %hhd\n", data_buffer[0]);
>>> }
>>>
>>> int main(void) {
>>> /* step 1: make data_buffer dirty */
>>> data_buffer[0] = 1;
>>>
>>> pthread_t thread;
>>> if (pthread_create(&thread, NULL, fuse_thread, NULL))
>>> errx(1, "pthread_create");
>>>
>>> sleep(1);
>>> /* step 3: fork a child */
>>> pid_t child = fork();
>>> if (child == -1)
>>> err(1, "fork");
>>> if (child == 0) {
>>> prctl(PR_SET_PDEATHSIG, SIGKILL);
>>> sleep(1);
>>>
>>> /* step 5: show pre-read state in the child */
>>> printf("char in child (before): %hhd\n", data_buffer[0]);
>>> sleep(3);
>>> /* step 7: read is complete, show post-read state in child */
>>> printf("char in child (after): %hhd\n", data_buffer[0]);
>>> return 0;
>>> }
>>>
>>> /* step 4: de-CoW data_buffer in the parent */
>>> data_buffer[0x800] = 2;
>>>
>>> int status;
>>> if (wait(&status) != child)
>>> err(1, "wait");
>>> }
>>> ====== END simple_mmap.c ======
>>>
>>> Repro steps:
>>>
>>> In one terminal:
>>> $ mkdir mount
>>> $ gcc -o hello hello.c -Wall -std=gnu99 `pkg-config fuse --cflags --libs`
>>> hello.c: In function ‘hello_write’:
>>> hello.c:67:1: warning: no return statement in function returning
>>> non-void [-Wreturn-type]
>>> }
>>> ^
>>> $ ./hello -d -o direct_io mount
>>> FUSE library version: 2.9.7
>>> [...]
>>>
>>> In a second terminal:
>>> $ gcc -pthread -o simple_mmap simple_mmap.c
>>> $ ./simple_mmap
>>> char in parent (before): 1
>>> char in child (before): 1
>>> fuse read result: 4096
>>> char in parent (after): 1
>>> char in child (after): 0
>>>
>>> I have tested that this still works on 4.19.0-rc3+.
>>>
>>> As far as I can tell, the fix would be to immediately copy pages with
>>> `refcount - mapcount > N` in dup_mmap(), or something like that?
>>>
next prev parent reply other threads:[~2018-09-18 9:14 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-17 16:12 [BUG] mm: direct I/O (using GUP) can write to COW anonymous pages Jann Horn
2018-09-18 0:05 ` Hugh Dickins
2018-09-18 0:19 ` Salman Qazi
2018-09-18 0:35 ` Jann Horn
2018-09-18 9:13 ` Konstantin Khlebnikov [this message]
2018-09-18 9:58 ` Jan Kara
2018-09-26 5:00 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5f794be8-f6f1-57f1-cb61-43e34cd6c4ed@yandex-team.ru \
--to=khlebnikov@yandex-team.ru \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=mst@redhat.com \
--cc=riel@redhat.com \
--cc=sqazi@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).