From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72487C04AAF for ; Tue, 21 May 2019 15:59:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1B0CF21019 for ; Tue, 21 May 2019 15:59:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1B0CF21019 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B89C26B0003; Tue, 21 May 2019 11:59:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B3B786B0006; Tue, 21 May 2019 11:59:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A2CF36B0007; Tue, 21 May 2019 11:59:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by kanga.kvack.org (Postfix) with ESMTP id 364096B0003 for ; Tue, 21 May 2019 11:59:34 -0400 (EDT) Received: by mail-lf1-f71.google.com with SMTP id f15so3278024lfc.10 for ; Tue, 21 May 2019 08:59:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:references:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=piXC7JvpJ5KHbYb9BFhwXM4V++BRZuuNYsJ4IDRj4co=; b=iep+X6no0DlkjSD7yrFQApp7fz6rYHkBrv7kNCAOdOrGzBiq98YHU0HnE9Boe2xL/b vmA7/+XRdyyb+Qq1r/UQ1U+LLOs4jT82KNFj7ksnmjnXR8jL/ftNMT9e9MoZNmHxfwF3 aOGYoHyfbxKl8VS4oaAEjvL9yW+XaJ5/g6wpwHw39OdcSv4miGIxOOIKlLAQyOwyaFzU 9j6avVjmv86S25LhwmrkvIWAIZ2sj3dXEqWqb0Ea0v0Ktv8sn58ICv2aXhtsw4sbItmi P/NgUw0cmt+ixGDv1v4EvsrMrWEWwm2OcWgMj1aCymd+CgRBifz9Og62njd0zijDkF2u UN3Q== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ktkhai@virtuozzo.com designates 185.231.240.75 as permitted sender) smtp.mailfrom=ktkhai@virtuozzo.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com X-Gm-Message-State: APjAAAXuhwJPeZgOz6jZ1vMEdRcJyc2OpjRm/hUwvA+pKl99YfdbMdXK lpJTc3lDpd5KO14GBTw14KLzPwRzNwj3quVWS0KShiV6gEGctZXLhtFgpUqqlPQrQZVJF66zENI 9BnoZXRFq8oNbBdnMSpfvfvCRl+QBO7co5oqHrnomXCdGb09tktq4Sj/ZAhP7zlOh+g== X-Received: by 2002:a2e:81d0:: with SMTP id s16mr19788472ljg.42.1558454373545; Tue, 21 May 2019 08:59:33 -0700 (PDT) X-Google-Smtp-Source: APXvYqyC0/bEthKafr9PeV7YVEvcmIkvsZGjBxw7kTSUlxNkFgKiKdavfQOD14m+lgdoRbOKqjNq X-Received: by 2002:a2e:81d0:: with SMTP id s16mr19788430ljg.42.1558454372562; Tue, 21 May 2019 08:59:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558454372; cv=none; d=google.com; s=arc-20160816; b=MBjz7I5lKw/7NIbiygeQZI/1d8zp/8wRVbMLTcaUDWp2i6m6ODgI6a83DFnnMxuSAY BsyBminHkpyQNs0t1vPL9YyXFbLkKkzqAh+obbVkKXHV0P+sHZilLkHQoBQc/2ooxJIx 8sdb5ABATwP+kg7NrA66Y7JQ5j3og+tBHCrTFp9BRf/X+U0c4sbR2iY9zOgQlHnZCohh sZ7q7ZGWLKyi6kGN3byWjRQci+nbRIHJkaYi4qVuQ74Mw9TWdXWh2v2YfQH3UaYop8RY Z+imLOoO8lkes63PbAlYazoD9ntm/cKKWZbVpOYlvJkpr/MocfAgvDNjl28wEv+gO6fx ydxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :user-agent:date:message-id:references:cc:to:from:subject; bh=piXC7JvpJ5KHbYb9BFhwXM4V++BRZuuNYsJ4IDRj4co=; b=gK3POBgMA2XagP1aTzXkDYACYjfEWp6WGcsg98kOzSmBDKbmCSqq3mAxW2a5W4Du0q tiNubWdq8r0U4eyQQaIs9SoECsYhN78gpDeLFmL2PUGN9uRTY3RE0/r0czDc/Rwt0jbn cyFko0ru6L7v2FB0FMvVtw2LouXy8iLMaBSK6vSZJM2J66CSB3AL5jM/idvN7ULEPN4b HG9Ut1CWB4bGxBYUiov5F0o9bmLBRYEyOl7Blvns8yu+ebtD9AUxjXJFg5eKXWYCKYTw FBuaHmEGbBvHvI9K/ewIkWJ2zC/05knIuGTeywxOIp4ZKQDWgxiMpaSpFMYHW60fEzpd qsDg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ktkhai@virtuozzo.com designates 185.231.240.75 as permitted sender) smtp.mailfrom=ktkhai@virtuozzo.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Received: from relay.sw.ru (relay.sw.ru. [185.231.240.75]) by mx.google.com with ESMTPS id r192si17616182lfr.13.2019.05.21.08.59.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 May 2019 08:59:32 -0700 (PDT) Received-SPF: pass (google.com: domain of ktkhai@virtuozzo.com designates 185.231.240.75 as permitted sender) client-ip=185.231.240.75; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ktkhai@virtuozzo.com designates 185.231.240.75 as permitted sender) smtp.mailfrom=ktkhai@virtuozzo.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Received: from [172.16.25.169] by relay.sw.ru with esmtp (Exim 4.91) (envelope-from ) id 1hT7Ao-0007Cl-2V; Tue, 21 May 2019 18:59:30 +0300 Subject: Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping From: Kirill Tkhai To: Andy Lutomirski Cc: Andrew Morton , Dan Williams , Michal Hocko , Keith Busch , "Kirill A. Shutemov" , alexander.h.duyck@linux.intel.com, Weiny Ira , Andrey Konovalov , arunks@codeaurora.org, Vlastimil Babka , Christoph Lameter , Rik van Riel , Kees Cook , Johannes Weiner , Nicholas Piggin , Mathieu Desnoyers , Shakeel Butt , Roman Gushchin , Andrea Arcangeli , Hugh Dickins , Jerome Glisse , Mel Gorman , daniel.m.jordan@oracle.com, Jann Horn , Adam Borowski , Linux API , LKML , Linux-MM References: <155836064844.2441.10911127801797083064.stgit@localhost.localdomain> <9638a51c-4295-924f-1852-1783c7f3e82d@virtuozzo.com> Message-ID: <6483b75c-9725-126e-6fb3-ce05fb703a87@virtuozzo.com> Date: Tue, 21 May 2019 18:59:29 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <9638a51c-4295-924f-1852-1783c7f3e82d@virtuozzo.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 21.05.2019 18:52, Kirill Tkhai wrote: > On 21.05.2019 17:43, Andy Lutomirski wrote: >> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai wrote: >>> >> >>> [Summary] >>> >>> New syscall, which allows to clone a remote process VMA >>> into local process VM. The remote process's page table >>> entries related to the VMA are cloned into local process's >>> page table (in any desired address, which makes this different >>> from that happens during fork()). Huge pages are handled >>> appropriately. >>> >>> This allows to improve performance in significant way like >>> it's shows in the example below. >>> >>> [Description] >>> >>> This patchset adds a new syscall, which makes possible >>> to clone a VMA from a process to current process. >>> The syscall supplements the functionality provided >>> by process_vm_writev() and process_vm_readv() syscalls, >>> and it may be useful in many situation. >>> >>> For example, it allows to make a zero copy of data, >>> when process_vm_writev() was previously used: >>> >>> struct iovec local_iov, remote_iov; >>> void *buf; >>> >>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>> recv(sock, buf, n * PAGE_SIZE, 0); >>> >>> local_iov->iov_base = buf; >>> local_iov->iov_len = n * PAGE_SIZE; >>> remove_iov = ...; >>> >>> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0); >>> munmap(buf, n * PAGE_SIZE); >>> >>> (Note, that above completely ignores error handling) >>> >>> There are several problems with process_vm_writev() in this example: >>> >>> 1)it causes pagefault on remote process memory, and it forces >>> allocation of a new page (if was not preallocated); >> >> I don't see how your new syscall helps. You're writing to remote >> memory. If that memory wasn't allocated, it's going to get allocated >> regardless of whether you use a write-like interface or an mmap-like >> interface. > > No, the talk is not about just another interface for copying memory. > The talk is about borrowing of remote task's VMA and corresponding > page table's content. Syscall allows to copy part of page table > with preallocated pages from remote to local process. See here: > > [task1] [task2] > > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, ...); > > > > buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...); > munmap(buf); > > > process_vm_mmap() copies PTEs related to memory of buf in task1 to task2 > just like in the way we do during fork syscall. > > There is no copying of buf memory content, unless COW happens. This is > the principal difference to process_vm_writev(), which just allocates > pages in remote VM. > >> Keep in mind that, on x86, just the hardware part of a >> page fault is very slow -- populating the memory with a syscall >> instead of a fault may well be faster. > > It is not as slow, as disk IO has. Just compare, what happens in case of anonymous > pages related to buf of task1 are swapped: > > 1)process_vm_writev() reads them back into memory; > > 2)process_vm_mmap() just copies swap PTEs from task1 page table > to task2 page table. > > Also, for faster page faults one may use huge pages for the mappings. > But really, it's funny to think about page faults, when there are > disk IO problems I shown. > >>> >>> 2)amount of memory for this example is doubled in a moment -- >>> n pages in current and n pages in remote tasks are occupied >>> at the same time; >> >> This seems disingenuous. If you're writing p pages total in chunks of >> n pages, you will use a total of p pages if you use mmap and p+n if >> you use write. > > I didn't understand this sentence because of many ifs, sorry. Could you > please explain your thought once again? > >> That only doubles the amount of memory if you let n >> scale linearly with p, which seems unlikely. >> >>> >>> 3)received data has no a chance to be properly swapped for >>> a long time. >> >> ... >> >>> a)kernel moves @buf pages into swap right after recv(); >>> b)process_vm_writev() reads the data back from swap to pages; >> >> If you're under that much memory pressure and thrashing that badly, >> your performance is going to be awful no matter what you're doing. If >> you indeed observe this behavior under normal loads, then this seems >> like a VM issue that should be addressed in its own right. > > I don't think so. Imagine: a container migrates from one node to another. > The nodes are the same, say, every of them has 4GB of RAM. > > Before the migration, the container's tasks used 4GB of RAM and 8GB of swap. > After the page server on the second node received the pages, we want these > pages become swapped as soon as possible, and we don't want to read them from > swap to pass a read consumer. Should be "to pass a *real* consumer". > > The page server is task1 in the example. The real consumer is task2. > > This is a rather normal load, I think. > >>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE, >>> MAP_PRIVATE|MAP_ANONYMOUS, ...); >>> recv(sock, buf, n * PAGE_SIZE, 0); >>> >>> [Task 2] >>> buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0); >>> >>> This creates a copy of VMA related to buf from task1 in task2's VM. >>> Task1's page table entries are copied into corresponding page table >>> entries of VM of task2. >> >> You need to fully explain a whole bunch of details that you're >> ignored. > > Yeah, it's not a problem :) I'm ready to explain and describe everything, > what may cause a question. Just ask ;) > >> For example, if the remote VMA is MAP_ANONYMOUS, do you get >> a CoW copy of it? I assume you don't since the whole point is to >> write to remote memory > > But, no, there *is* COW semantic. We do not copy memory. We copy > page table content. This is just the same we have on fork(), when > children duplicates parent's VMA and related page table subset, > and parent's PTEs lose _PAGE_RW flag. > > There is all copy_page_range() code reused for that. Please, see [3/7] > for the details. > > I'm going to get special performance using THP, when number of entries > to copy is smaller than in case of PTE. > > Copy several of PMD from one task page table to another's is much much much faster, > than process_vm_write() copies pages (even not mention about its reading from swap). > >> ,but it's at the very least quite unusual in >> Linux to have two different anonymous VMAs such that writing one of >> them changes the other one. > Writing to a new VMA does not affect old VMA. Old VMA is just used to > get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent > each other. > >> But there are plenty of other questions. >> What happens if the remote VMA is a gate area or other special mapping >> (vDSO, vvar area, etc)? What if the remote memory comes from a driver >> that wasn't expecting the mapping to get magically copied to a >> different process? > > In case of someone wants to duplicate such the mappings, we may consider > that, and extend the interface in the future for VMA types, which are > safe for that. > > But now the logic is very overprotective, and all the unusual mappings > like you mentioned (also AIO, etc) is prohibited. Please, see [7/7] > for the details. > >> This new API seems quite dangerous and complex to me, and I don't >> think the value has been adequately demonstrated. > > I don't think it's dangerous and complex, because of I haven't introduced > any principal VMA conceptions different to what we have now. We just > borrow vma->anon_vma and vma->vm_file from remote process to local > like we did on fork() (borrowing of vma->anon_vma means not blindly > copying, but ordinary anon_vma_fork()). > > Maybe I had to focus the description more on copying of PTE/PMD > instead of vma duplication. So, it's unexpected for me, that people > think about simple memory copying after reading the example I gave. > But I gave more explanation here, so I hope the situation became > clearer for a reader. Anyway, if you have any questions, please > ask me. > > Thanks, > Kirill >