From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f197.google.com (mail-yw0-f197.google.com [209.85.161.197]) by kanga.kvack.org (Postfix) with ESMTP id 1FDD96B0292 for ; Thu, 6 Jul 2017 12:17:45 -0400 (EDT) Received: by mail-yw0-f197.google.com with SMTP id c13so5020600ywa.13 for ; Thu, 06 Jul 2017 09:17:45 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id k189si111016ybb.127.2017.07.06.09.17.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Jul 2017 09:17:44 -0700 (PDT) From: Mike Kravetz Subject: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag Date: Thu, 6 Jul 2017 09:17:25 -0700 Message-Id: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" , Mike Kravetz The mremap system call has the ability to 'mirror' parts of an existing mapping. To do so, it creates a new mapping that maps the same pages as the original mapping, just at a different virtual address. This functionality has existed since at least the 2.6 kernel [1]. A comment was added to the code to help preserve this feature. The Oracle JVM team has discovered this feature and used it while prototyping a new garbage collection model. This new model shows promise, and they are considering its use in a future release. However, since the only mention of this functionality is a single comment in the kernel, they are concerned about its future. I propose the addition of a new MREMAP_MIRROR flag to explicitly request this functionality. The flag simply provides the same functionality as the existing undocumented 'old_size == 0' interface. As an alternative, we could simply document the 'old_size == 0' interface in the man page. In either case, man page modifications would be needed. Future Direction After more formally adding this to the API (either new flag or documenting existing interface), the mremap code could be enhanced to optimize this case. Currently, 'mirroring' only sets up the new mapping. It does not create page table entries for new mapping. This could be added as an enhancement. The JVM today has the option of using (static) huge pages. The mremap system call does not fully support huge page mappings today. You can use mremap to shrink the size of a huge page mapping, but it can not be used to expand or mirror a mapping. Such support is fairly straight forward. [1] https://lkml.org/lkml/2004/1/12/260 Mike Kravetz (1): mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality include/uapi/linux/mman.h | 5 +++-- mm/mremap.c | 23 ++++++++++++++++------- tools/include/uapi/linux/mman.h | 5 +++-- 3 files changed, 22 insertions(+), 11 deletions(-) -- 2.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f199.google.com (mail-yb0-f199.google.com [209.85.213.199]) by kanga.kvack.org (Postfix) with ESMTP id B21CA6B02B4 for ; Thu, 6 Jul 2017 12:17:50 -0400 (EDT) Received: by mail-yb0-f199.google.com with SMTP id j80so4940712ybg.8 for ; Thu, 06 Jul 2017 09:17:50 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id z7si104652ywa.515.2017.07.06.09.17.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Jul 2017 09:17:49 -0700 (PDT) From: Mike Kravetz Subject: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Date: Thu, 6 Jul 2017 09:17:26 -0700 Message-Id: <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> In-Reply-To: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" , Mike Kravetz The mremap system call has the ability to 'mirror' parts of an existing mapping. To do so, it creates a new mapping that maps the same pages as the original mapping, just at a different virtual address. This functionality has existed since at least the 2.6 kernel. This patch simply adds a new flag to mremap which will make this functionality part of the API. It maintains backward compatibility with the existing way of requesting mirroring (old_size == 0). If this new MREMAP_MIRROR flag is specified, then new_size must equal old_size. In addition, the MREMAP_MAYMOVE flag must be specified. Signed-off-by: Mike Kravetz --- include/uapi/linux/mman.h | 5 +++-- mm/mremap.c | 23 ++++++++++++++++------- tools/include/uapi/linux/mman.h | 5 +++-- 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index ade4acd..6b3e0df 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -3,8 +3,9 @@ #include -#define MREMAP_MAYMOVE 1 -#define MREMAP_FIXED 2 +#define MREMAP_MAYMOVE 0x01 +#define MREMAP_FIXED 0x02 +#define MREMAP_MIRROR 0x04 #define OVERCOMMIT_GUESS 0 #define OVERCOMMIT_ALWAYS 1 diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b1..f18ab36 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX; LIST_HEAD(uf_unmap); - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE)) + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR)) return ret; - if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE)) + if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) && + !(flags & MREMAP_MAYMOVE)) return ret; if (offset_in_page(addr)) @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, old_len = PAGE_ALIGN(old_len); new_len = PAGE_ALIGN(new_len); - /* - * We allow a zero old-len as a special case - * for DOS-emu "duplicate shm area" thing. But - * a zero new-len is nonsensical. - */ + /* A zero new-len is nonsensical. */ if (!new_len) return ret; + /* + * For backward compatibility, we allow a zero old-len to imply + * mirroring. This was originally a special case for DOS-emu. + */ + if (!old_len) + flags |= MREMAP_MIRROR; + else if (flags & MREMAP_MIRROR) { + if (old_len != new_len) + return ret; + old_len = 0; + } + if (down_write_killable(¤t->mm->mmap_sem)) return -EINTR; diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h index 81d8edf..069f7a5 100644 --- a/tools/include/uapi/linux/mman.h +++ b/tools/include/uapi/linux/mman.h @@ -3,8 +3,9 @@ #include -#define MREMAP_MAYMOVE 1 -#define MREMAP_FIXED 2 +#define MREMAP_MAYMOVE 0x01 +#define MREMAP_FIXED 0x02 +#define MREMAP_MIRROR 0x04 #define OVERCOMMIT_GUESS 0 #define OVERCOMMIT_ALWAYS 1 -- 2.7.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id EE9CA6B0279 for ; Fri, 7 Jul 2017 04:20:03 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id v88so6281953wrb.1 for ; Fri, 07 Jul 2017 01:20:03 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com. [148.163.158.5]) by mx.google.com with ESMTPS id v193si2322061wme.197.2017.07.07.01.20.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 01:20:02 -0700 (PDT) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v678JDml026586 for ; Fri, 7 Jul 2017 04:20:01 -0400 Received: from e23smtp05.au.ibm.com (e23smtp05.au.ibm.com [202.81.31.147]) by mx0b-001b2d01.pphosted.com with ESMTP id 2bhwbyrrjj-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 07 Jul 2017 04:20:00 -0400 Received: from localhost by e23smtp05.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 7 Jul 2017 18:19:58 +1000 Received: from d23av06.au.ibm.com (d23av06.au.ibm.com [9.190.235.151]) by d23relay09.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v678JspO7143758 for ; Fri, 7 Jul 2017 18:19:54 +1000 Received: from d23av06.au.ibm.com (localhost [127.0.0.1]) by d23av06.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v678Js61008806 for ; Fri, 7 Jul 2017 18:19:54 +1000 Subject: Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> From: Anshuman Khandual Date: Fri, 7 Jul 2017 13:49:46 +0530 MIME-Version: 1.0 In-Reply-To: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Message-Id: <6f1460ef-a896-aef4-c0dc-66227232e025@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/06/2017 09:47 PM, Mike Kravetz wrote: > The mremap system call has the ability to 'mirror' parts of an existing > mapping. To do so, it creates a new mapping that maps the same pages as > the original mapping, just at a different virtual address. This > functionality has existed since at least the 2.6 kernel [1]. A comment > was added to the code to help preserve this feature. Is this the comment ? If yes, then its not very clear. /* * We allow a zero old-len as a special case * for DOS-emu "duplicate shm area" thing. But * a zero new-len is nonsensical. */ > > The Oracle JVM team has discovered this feature and used it while > prototyping a new garbage collection model. This new model shows promise, > and they are considering its use in a future release. However, since > the only mention of this functionality is a single comment in the kernel, > they are concerned about its future. > > I propose the addition of a new MREMAP_MIRROR flag to explicitly request > this functionality. The flag simply provides the same functionality as > the existing undocumented 'old_size == 0' interface. As an alternative, > we could simply document the 'old_size == 0' interface in the man page. > In either case, man page modifications would be needed. Right. Adding MREMAP_MIRROR sounds cleaner from application programming point of view. But it extends the interface. > > Future Direction > > After more formally adding this to the API (either new flag or documenting > existing interface), the mremap code could be enhanced to optimize this > case. Currently, 'mirroring' only sets up the new mapping. It does not > create page table entries for new mapping. This could be added as an > enhancement. Then how it achieves mirroring, both the pointers should see the same data, that can happen with page table entries pointing to same pages, right ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 1FE0D6B0279 for ; Fri, 7 Jul 2017 04:47:01 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id c12so26657924pfj.12 for ; Fri, 07 Jul 2017 01:47:01 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com. [148.163.156.1]) by mx.google.com with ESMTPS id f17si1773938pfd.162.2017.07.07.01.47.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 01:47:00 -0700 (PDT) Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v678iEhg142066 for ; Fri, 7 Jul 2017 04:46:59 -0400 Received: from e23smtp08.au.ibm.com (e23smtp08.au.ibm.com [202.81.31.141]) by mx0a-001b2d01.pphosted.com with ESMTP id 2bj1a6jx92-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 07 Jul 2017 04:46:59 -0400 Received: from localhost by e23smtp08.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 7 Jul 2017 18:46:56 +1000 Received: from d23av05.au.ibm.com (d23av05.au.ibm.com [9.190.234.119]) by d23relay08.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v678klYC10355006 for ; Fri, 7 Jul 2017 18:46:55 +1000 Received: from d23av05.au.ibm.com (localhost [127.0.0.1]) by d23av05.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v678kMKU016690 for ; Fri, 7 Jul 2017 18:46:23 +1000 Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> From: Anshuman Khandual Date: Fri, 7 Jul 2017 14:15:58 +0530 MIME-Version: 1.0 In-Reply-To: <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/06/2017 09:47 PM, Mike Kravetz wrote: > The mremap system call has the ability to 'mirror' parts of an existing > mapping. To do so, it creates a new mapping that maps the same pages as > the original mapping, just at a different virtual address. This > functionality has existed since at least the 2.6 kernel. > > This patch simply adds a new flag to mremap which will make this > functionality part of the API. It maintains backward compatibility with > the existing way of requesting mirroring (old_size == 0). > > If this new MREMAP_MIRROR flag is specified, then new_size must equal > old_size. In addition, the MREMAP_MAYMOVE flag must be specified. Yeah it all looks good. But why is this requirement that if MREMAP_MAYMOVE is specified then old_size and new_size must be equal. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id A9FB36B02C3 for ; Fri, 7 Jul 2017 06:23:28 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id l34so6863653wrc.12 for ; Fri, 07 Jul 2017 03:23:28 -0700 (PDT) Received: from mail-wr0-x242.google.com (mail-wr0-x242.google.com. [2a00:1450:400c:c0c::242]) by mx.google.com with ESMTPS id e12si1966790wra.216.2017.07.07.03.23.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 03:23:26 -0700 (PDT) Received: by mail-wr0-x242.google.com with SMTP id z45so6678665wrb.2 for ; Fri, 07 Jul 2017 03:23:26 -0700 (PDT) Date: Fri, 7 Jul 2017 13:23:24 +0300 From: "Kirill A. Shutemov" Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote: > The mremap system call has the ability to 'mirror' parts of an existing > mapping. To do so, it creates a new mapping that maps the same pages as > the original mapping, just at a different virtual address. This > functionality has existed since at least the 2.6 kernel. > > This patch simply adds a new flag to mremap which will make this > functionality part of the API. It maintains backward compatibility with > the existing way of requesting mirroring (old_size == 0). > > If this new MREMAP_MIRROR flag is specified, then new_size must equal > old_size. In addition, the MREMAP_MAYMOVE flag must be specified. The patch breaks important invariant that anon page can be mapped into a process only once. What is going to happen to mirrored after CoW for instance? In my opinion, it shouldn't be allowed for anon/private mappings at least. And with this limitation, I don't see much sense in the new interface -- just create mirror by mmap()ing the file again. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 344196B02C3 for ; Fri, 7 Jul 2017 07:05:03 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id v88so7209659wrb.1 for ; Fri, 07 Jul 2017 04:05:03 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com. [148.163.158.5]) by mx.google.com with ESMTPS id y128si2690444wmg.15.2017.07.07.04.05.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 04:05:01 -0700 (PDT) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v67B4v7n118212 for ; Fri, 7 Jul 2017 07:05:00 -0400 Received: from e23smtp06.au.ibm.com (e23smtp06.au.ibm.com [202.81.31.148]) by mx0a-001b2d01.pphosted.com with ESMTP id 2bj2whn37v-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 07 Jul 2017 07:04:59 -0400 Received: from localhost by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 7 Jul 2017 21:04:56 +1000 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay08.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v67B3d777012748 for ; Fri, 7 Jul 2017 21:03:39 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v67B3dbN022582 for ; Fri, 7 Jul 2017 21:03:39 +1000 Subject: Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> From: Anshuman Khandual Date: Fri, 7 Jul 2017 16:33:36 +0530 MIME-Version: 1.0 In-Reply-To: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Message-Id: <0f935c5a-2580-c95a-4ea5-c25e796dad03@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/06/2017 09:47 PM, Mike Kravetz wrote: > The mremap system call has the ability to 'mirror' parts of an existing > mapping. To do so, it creates a new mapping that maps the same pages as > the original mapping, just at a different virtual address. This > functionality has existed since at least the 2.6 kernel [1]. A comment > was added to the code to help preserve this feature. In mremap() implementation move_vma() attempts to do do_unmap() after move_page_tables(). do_unmap() on the original VMA bails out because the requested length being 0. Hence both the original VMA and the new VMA remains after the page table migration. Seems like this whole mirror function is by coincidence or it has been designed that way ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f72.google.com (mail-vk0-f72.google.com [209.85.213.72]) by kanga.kvack.org (Postfix) with ESMTP id F1B106B02F3 for ; Fri, 7 Jul 2017 13:04:16 -0400 (EDT) Received: by mail-vk0-f72.google.com with SMTP id 191so13584110vko.1 for ; Fri, 07 Jul 2017 10:04:16 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id e38si2001469uah.20.2017.07.07.10.04.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 10:04:15 -0700 (PDT) Subject: Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <6f1460ef-a896-aef4-c0dc-66227232e025@linux.vnet.ibm.com> From: Mike Kravetz Message-ID: Date: Fri, 7 Jul 2017 10:04:04 -0700 MIME-Version: 1.0 In-Reply-To: <6f1460ef-a896-aef4-c0dc-66227232e025@linux.vnet.ibm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Anshuman Khandual , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 01:19 AM, Anshuman Khandual wrote: > On 07/06/2017 09:47 PM, Mike Kravetz wrote: >> The mremap system call has the ability to 'mirror' parts of an existing >> mapping. To do so, it creates a new mapping that maps the same pages as >> the original mapping, just at a different virtual address. This >> functionality has existed since at least the 2.6 kernel [1]. A comment >> was added to the code to help preserve this feature. > > > Is this the comment ? If yes, then its not very clear. > > /* > * We allow a zero old-len as a special case > * for DOS-emu "duplicate shm area" thing. But > * a zero new-len is nonsensical. > */ > Yes, I believe that is the comment. >> >> The Oracle JVM team has discovered this feature and used it while >> prototyping a new garbage collection model. This new model shows promise, >> and they are considering its use in a future release. However, since >> the only mention of this functionality is a single comment in the kernel, >> they are concerned about its future. >> >> I propose the addition of a new MREMAP_MIRROR flag to explicitly request >> this functionality. The flag simply provides the same functionality as >> the existing undocumented 'old_size == 0' interface. As an alternative, >> we could simply document the 'old_size == 0' interface in the man page. >> In either case, man page modifications would be needed. > > Right. Adding MREMAP_MIRROR sounds cleaner from application programming > point of view. But it extends the interface. Yes. That is the reason for the RFC. We currently have functionality that is not clearly part of a programming interface. Application programmers do not like to depend on something that is not part of an interface. >> >> Future Direction >> >> After more formally adding this to the API (either new flag or documenting >> existing interface), the mremap code could be enhanced to optimize this >> case. Currently, 'mirroring' only sets up the new mapping. It does not >> create page table entries for new mapping. This could be added as an >> enhancement. > > Then how it achieves mirroring, both the pointers should see the same > data, that can happen with page table entries pointing to same pages, > right ? Correct. In the code today, page tables for the new (mirrored) mapping are created as needed via faults. The enhancement would be to create page table entries for the new mapping. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f72.google.com (mail-vk0-f72.google.com [209.85.213.72]) by kanga.kvack.org (Postfix) with ESMTP id 467726B02FD for ; Fri, 7 Jul 2017 13:12:41 -0400 (EDT) Received: by mail-vk0-f72.google.com with SMTP id p193so13568485vkd.11 for ; Fri, 07 Jul 2017 10:12:41 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id 52si1966649uah.167.2017.07.07.10.12.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 10:12:40 -0700 (PDT) Subject: Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <0f935c5a-2580-c95a-4ea5-c25e796dad03@linux.vnet.ibm.com> From: Mike Kravetz Message-ID: <7a5d293b-44d7-b0f4-20e5-6a3428c25ed2@oracle.com> Date: Fri, 7 Jul 2017 10:12:32 -0700 MIME-Version: 1.0 In-Reply-To: <0f935c5a-2580-c95a-4ea5-c25e796dad03@linux.vnet.ibm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Anshuman Khandual , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 04:03 AM, Anshuman Khandual wrote: > On 07/06/2017 09:47 PM, Mike Kravetz wrote: >> The mremap system call has the ability to 'mirror' parts of an existing >> mapping. To do so, it creates a new mapping that maps the same pages as >> the original mapping, just at a different virtual address. This >> functionality has existed since at least the 2.6 kernel [1]. A comment >> was added to the code to help preserve this feature. > > In mremap() implementation move_vma() attempts to do do_unmap() after > move_page_tables(). do_unmap() on the original VMA bails out because > the requested length being 0. Hence both the original VMA and the new > VMA remains after the page table migration. Seems like this whole > mirror function is by coincidence or it has been designed that way ? I honestly do not know. From what I can tell, the functionality existed in 2.4. The email thread [1], exists because it was 'accidentally' removed in 2.6. All of this is before git history (and my involvement). My 'guess' is that this functionality was created by coincidence. Someone noticed it and took advantage of it. When it was removed, their code broke. The code was 'fixed' and a comment was added to the code in an attempt to prevent removing the functionality in the future. Again, this is speculation as I was not originally involved. The point of this RFC is to consider adding the functionality to the API. If we are carrying the functionality in the code, we should at least document so that application programmers can take advantage of it. [1] https://lkml.org/lkml/2004/1/12/260 -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f70.google.com (mail-vk0-f70.google.com [209.85.213.70]) by kanga.kvack.org (Postfix) with ESMTP id B92EB6B0279 for ; Fri, 7 Jul 2017 13:15:00 -0400 (EDT) Received: by mail-vk0-f70.google.com with SMTP id y70so13522271vky.13 for ; Fri, 07 Jul 2017 10:15:00 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id 52si1969158uah.167.2017.07.07.10.14.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 10:15:00 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> From: Mike Kravetz Message-ID: <3a43d5fa-223d-1315-513b-85d3a09a07b6@oracle.com> Date: Fri, 7 Jul 2017 10:14:53 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Anshuman Khandual , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 01:45 AM, Anshuman Khandual wrote: > On 07/06/2017 09:47 PM, Mike Kravetz wrote: >> The mremap system call has the ability to 'mirror' parts of an existing >> mapping. To do so, it creates a new mapping that maps the same pages as >> the original mapping, just at a different virtual address. This >> functionality has existed since at least the 2.6 kernel. >> >> This patch simply adds a new flag to mremap which will make this >> functionality part of the API. It maintains backward compatibility with >> the existing way of requesting mirroring (old_size == 0). >> >> If this new MREMAP_MIRROR flag is specified, then new_size must equal >> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. > > Yeah it all looks good. But why is this requirement that if > MREMAP_MAYMOVE is specified then old_size and new_size must > be equal. No real reason. I just wanted to clearly separate the new interface from the old. On second thought, it would be better to require old_size == 0 as in the legacy interface. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199]) by kanga.kvack.org (Postfix) with ESMTP id 6C4BF6B0279 for ; Fri, 7 Jul 2017 13:30:01 -0400 (EDT) Received: by mail-qt0-f199.google.com with SMTP id w12so17940837qta.8 for ; Fri, 07 Jul 2017 10:30:01 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id l83si3735045qki.251.2017.07.07.10.29.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 10:30:00 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> From: Mike Kravetz Message-ID: Date: Fri, 7 Jul 2017 10:29:52 -0700 MIME-Version: 1.0 In-Reply-To: <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: > On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote: >> The mremap system call has the ability to 'mirror' parts of an existing >> mapping. To do so, it creates a new mapping that maps the same pages as >> the original mapping, just at a different virtual address. This >> functionality has existed since at least the 2.6 kernel. >> >> This patch simply adds a new flag to mremap which will make this >> functionality part of the API. It maintains backward compatibility with >> the existing way of requesting mirroring (old_size == 0). >> >> If this new MREMAP_MIRROR flag is specified, then new_size must equal >> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. > > The patch breaks important invariant that anon page can be mapped into a > process only once. Actually, the patch does not add any new functionality. It only provides a new interface to existing functionality. Is it not possible to have an anon page mapped twice into the same process via system V shared memory? shmget(anon), shmat(), shmat. Of course, those are shared rather than private anon pages. > > What is going to happen to mirrored after CoW for instance? > > In my opinion, it shouldn't be allowed for anon/private mappings at least. > And with this limitation, I don't see much sense in the new interface -- > just create mirror by mmap()ing the file again. The code today works for anon shared mappings. See simple program below. You are correct in that it makes little or no sense for private mappings. When looking closer at existing code, mremap() creates a new private mapping in this case. This is most likely a bug. Again, my intention is not to create new functionality but rather document existing functionality as part of a programming interface. -- Mike Kravetz #include #include #include #include #define __USE_GNU #include #include #include #include #include #define H_PAGESIZE (2 * 1024 * 1024) int hugetlb = 0; #define PROTECTION (PROT_READ | PROT_WRITE) #define ADDR (void *)(0x0UL) /* #define FLAGS (MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB) */ #define FLAGS (MAP_SHARED|MAP_ANONYMOUS) int main(int argc, char ** argv) { int fd, ret; int i; long long hpages, tpage; void *addr; void *addr2; char foo; if (argc == 2) { if (!strcmp(argv[1], "hugetlb")) hugetlb = 1; } hpages = 5; printf("Reserving an address ...\n"); addr = mmap(ADDR, H_PAGESIZE * hpages * 2, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE| (hugetlb ? MAP_HUGETLB : 0), -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit (1); } printf("\tgot address %p to %p\n", (void *)addr, (void *)(addr + H_PAGESIZE * hpages * 2)); printf("mmapping %d 2MB huge pages\n", hpages); addr = mmap(addr, H_PAGESIZE * hpages, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS| (hugetlb ? MAP_HUGETLB : 0), -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit (1); } /* initialize data */ for (i = 0; i < hpages; i++) *((char *)addr + (i * H_PAGESIZE)) = 'a'; printf("pages allocated and initialized at %p\n", (void *)addr); addr2 = mremap(addr, 0, H_PAGESIZE * hpages, MREMAP_MAYMOVE | MREMAP_FIXED, addr + (H_PAGESIZE * hpages)); if (addr2 == MAP_FAILED) { perror("mremap"); exit (1); } printf("mapping relocated to %p\n", (void *)addr2); /* verify data */ printf("Verifying data at address %p\n", (void *)addr); for (i = 0; i < hpages; i++) { if (*((char *)addr + (i * H_PAGESIZE)) != 'a') { printf("data at address %p not as expected\n", (void *)((char *)addr + (i * H_PAGESIZE))); } } if (i >= hpages) printf("\t success!\n"); /* verify data */ printf("Verifying data at address %p\n", (void *)addr2); for (i = 0; i < hpages; i++) { if (*((char *)addr2 + (i * H_PAGESIZE)) != 'a') { printf("data at address %p not as expected\n", (void *)((char *)addr2 + (i * H_PAGESIZE))); } } if (i >= hpages) printf("\t success!\n"); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 202C06B0279 for ; Fri, 7 Jul 2017 13:45:38 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id z81so9634190wrc.2 for ; Fri, 07 Jul 2017 10:45:38 -0700 (PDT) Received: from mail-wr0-x243.google.com (mail-wr0-x243.google.com. [2a00:1450:400c:c0c::243]) by mx.google.com with ESMTPS id b8si2496520wrb.130.2017.07.07.10.45.36 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 10:45:36 -0700 (PDT) Received: by mail-wr0-x243.google.com with SMTP id z45so9185398wrb.2 for ; Fri, 07 Jul 2017 10:45:36 -0700 (PDT) Date: Fri, 7 Jul 2017 20:45:34 +0300 From: "Kirill A. Shutemov" Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote: > On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: > > On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote: > >> The mremap system call has the ability to 'mirror' parts of an existing > >> mapping. To do so, it creates a new mapping that maps the same pages as > >> the original mapping, just at a different virtual address. This > >> functionality has existed since at least the 2.6 kernel. > >> > >> This patch simply adds a new flag to mremap which will make this > >> functionality part of the API. It maintains backward compatibility with > >> the existing way of requesting mirroring (old_size == 0). > >> > >> If this new MREMAP_MIRROR flag is specified, then new_size must equal > >> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. > > > > The patch breaks important invariant that anon page can be mapped into a > > process only once. > > Actually, the patch does not add any new functionality. It only provides > a new interface to existing functionality. > > Is it not possible to have an anon page mapped twice into the same process > via system V shared memory? shmget(anon), shmat(), shmat. > Of course, those are shared rather than private anon pages. By anon pages I mean, private anon or file pages. These are subject to CoW. > > What is going to happen to mirrored after CoW for instance? > > > > In my opinion, it shouldn't be allowed for anon/private mappings at least. > > And with this limitation, I don't see much sense in the new interface -- > > just create mirror by mmap()ing the file again. > > The code today works for anon shared mappings. See simple program below. > > You are correct in that it makes little or no sense for private mappings. > When looking closer at existing code, mremap() creates a new private > mapping in this case. This is most likely a bug. IIRC, existing code doesn't create mirrors of private pages as it requires old_len to be zero. There's no way to get private pages mapped twice this way. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f69.google.com (mail-it0-f69.google.com [209.85.214.69]) by kanga.kvack.org (Postfix) with ESMTP id 21FA26B02F4 for ; Fri, 7 Jul 2017 14:09:41 -0400 (EDT) Received: by mail-it0-f69.google.com with SMTP id 188so60223839itx.9 for ; Fri, 07 Jul 2017 11:09:41 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id 189si82949itz.49.2017.07.07.11.09.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 07 Jul 2017 11:09:40 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> From: Mike Kravetz Message-ID: <79eca23d-9f1a-9713-3f6b-8f7598d53190@oracle.com> Date: Fri, 7 Jul 2017 11:09:26 -0700 MIME-Version: 1.0 In-Reply-To: <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote: > On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote: >> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: >>> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote: >>>> The mremap system call has the ability to 'mirror' parts of an existing >>>> mapping. To do so, it creates a new mapping that maps the same pages as >>>> the original mapping, just at a different virtual address. This >>>> functionality has existed since at least the 2.6 kernel. >>>> >>>> This patch simply adds a new flag to mremap which will make this >>>> functionality part of the API. It maintains backward compatibility with >>>> the existing way of requesting mirroring (old_size == 0). >>>> >>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal >>>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. >>> >>> The patch breaks important invariant that anon page can be mapped into a >>> process only once. >> >> Actually, the patch does not add any new functionality. It only provides >> a new interface to existing functionality. >> >> Is it not possible to have an anon page mapped twice into the same process >> via system V shared memory? shmget(anon), shmat(), shmat. >> Of course, those are shared rather than private anon pages. > > By anon pages I mean, private anon or file pages. These are subject to CoW. > >>> What is going to happen to mirrored after CoW for instance? >>> >>> In my opinion, it shouldn't be allowed for anon/private mappings at least. >>> And with this limitation, I don't see much sense in the new interface -- >>> just create mirror by mmap()ing the file again. >> >> The code today works for anon shared mappings. See simple program below. >> >> You are correct in that it makes little or no sense for private mappings. >> When looking closer at existing code, mremap() creates a new private >> mapping in this case. This is most likely a bug. > > IIRC, existing code doesn't create mirrors of private pages as it requires > old_len to be zero. There's no way to get private pages mapped twice this > way. Correct. As mentioned above, mremap does 'something' for private anon pages when old_len == 0. However, this may be considered a bug. In this case, mremap creates a new private anon mapping of length new_size. Since old_len == 0, it does not unmap any of the old mapping. So, in this case mremap basically creates a new private mapping (unrealted to the original) and does not modify the old mapping. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 897E6440843 for ; Sun, 9 Jul 2017 03:23:43 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id z45so17200381wrb.13 for ; Sun, 09 Jul 2017 00:23:43 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com. [148.163.158.5]) by mx.google.com with ESMTPS id p64si1536306wmp.45.2017.07.09.00.23.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 09 Jul 2017 00:23:42 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v697IqK2088825 for ; Sun, 9 Jul 2017 03:23:40 -0400 Received: from e23smtp02.au.ibm.com (e23smtp02.au.ibm.com [202.81.31.144]) by mx0b-001b2d01.pphosted.com with ESMTP id 2bjux32bg9-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Sun, 09 Jul 2017 03:23:40 -0400 Received: from localhost by e23smtp02.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 9 Jul 2017 17:23:37 +1000 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay10.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v697NZgM20185176 for ; Sun, 9 Jul 2017 17:23:35 +1000 Received: from d23av03.au.ibm.com (localhost [127.0.0.1]) by d23av03.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v697NQcJ007697 for ; Sun, 9 Jul 2017 17:23:26 +1000 Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <3a43d5fa-223d-1315-513b-85d3a09a07b6@oracle.com> From: Anshuman Khandual Date: Sun, 9 Jul 2017 12:53:30 +0530 MIME-Version: 1.0 In-Reply-To: <3a43d5fa-223d-1315-513b-85d3a09a07b6@oracle.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Message-Id: <37f275bb-57c2-1485-02f2-dc71021f612a@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz , Anshuman Khandual , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 10:44 PM, Mike Kravetz wrote: > On 07/07/2017 01:45 AM, Anshuman Khandual wrote: >> On 07/06/2017 09:47 PM, Mike Kravetz wrote: >>> The mremap system call has the ability to 'mirror' parts of an existing >>> mapping. To do so, it creates a new mapping that maps the same pages as >>> the original mapping, just at a different virtual address. This >>> functionality has existed since at least the 2.6 kernel. >>> >>> This patch simply adds a new flag to mremap which will make this >>> functionality part of the API. It maintains backward compatibility with >>> the existing way of requesting mirroring (old_size == 0). >>> >>> If this new MREMAP_MIRROR flag is specified, then new_size must equal >>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. >> >> Yeah it all looks good. But why is this requirement that if >> MREMAP_MAYMOVE is specified then old_size and new_size must >> be equal. > > No real reason. I just wanted to clearly separate the new interface from > the old. On second thought, it would be better to require old_size == 0 > as in the legacy interface. That would be redundant. Mirroring will just happen because old_size is 0 whether we mention the MREMAP_MIRROR flag or not. IMHO it should just mirror if the flag is specified irrespective of the old_size value. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id B857E440843 for ; Sun, 9 Jul 2017 03:32:15 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id j186so81095580pge.12 for ; Sun, 09 Jul 2017 00:32:15 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com. [148.163.156.1]) by mx.google.com with ESMTPS id d34si2784656pld.416.2017.07.09.00.32.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 09 Jul 2017 00:32:14 -0700 (PDT) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v697TFMH131208 for ; Sun, 9 Jul 2017 03:32:14 -0400 Received: from e23smtp05.au.ibm.com (e23smtp05.au.ibm.com [202.81.31.147]) by mx0a-001b2d01.pphosted.com with ESMTP id 2bjtpt4ar7-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Sun, 09 Jul 2017 03:32:14 -0400 Received: from localhost by e23smtp05.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 9 Jul 2017 17:32:11 +1000 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay08.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id v697W9Vm14680250 for ; Sun, 9 Jul 2017 17:32:09 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id v697W8hS012667 for ; Sun, 9 Jul 2017 17:32:08 +1000 Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> <79eca23d-9f1a-9713-3f6b-8f7598d53190@oracle.com> From: Anshuman Khandual Date: Sun, 9 Jul 2017 13:02:02 +0530 MIME-Version: 1.0 In-Reply-To: <79eca23d-9f1a-9713-3f6b-8f7598d53190@oracle.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Message-Id: <662d372a-5737-5f0b-8ac1-c997f3a935eb@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz , "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/07/2017 11:39 PM, Mike Kravetz wrote: > On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote: >> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote: >>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: >>>> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote: >>>>> The mremap system call has the ability to 'mirror' parts of an existing >>>>> mapping. To do so, it creates a new mapping that maps the same pages as >>>>> the original mapping, just at a different virtual address. This >>>>> functionality has existed since at least the 2.6 kernel. >>>>> >>>>> This patch simply adds a new flag to mremap which will make this >>>>> functionality part of the API. It maintains backward compatibility with >>>>> the existing way of requesting mirroring (old_size == 0). >>>>> >>>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal >>>>> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. >>>> >>>> The patch breaks important invariant that anon page can be mapped into a >>>> process only once. >>> >>> Actually, the patch does not add any new functionality. It only provides >>> a new interface to existing functionality. >>> >>> Is it not possible to have an anon page mapped twice into the same process >>> via system V shared memory? shmget(anon), shmat(), shmat. >>> Of course, those are shared rather than private anon pages. >> >> By anon pages I mean, private anon or file pages. These are subject to CoW. >> >>>> What is going to happen to mirrored after CoW for instance? >>>> >>>> In my opinion, it shouldn't be allowed for anon/private mappings at least. >>>> And with this limitation, I don't see much sense in the new interface -- >>>> just create mirror by mmap()ing the file again. >>> >>> The code today works for anon shared mappings. See simple program below. >>> >>> You are correct in that it makes little or no sense for private mappings. >>> When looking closer at existing code, mremap() creates a new private >>> mapping in this case. This is most likely a bug. >> >> IIRC, existing code doesn't create mirrors of private pages as it requires >> old_len to be zero. There's no way to get private pages mapped twice this >> way. > > Correct. > As mentioned above, mremap does 'something' for private anon pages when > old_len == 0. However, this may be considered a bug. In this case, mremap > creates a new private anon mapping of length new_size. Since old_len == 0, > it does not unmap any of the old mapping. So, in this case mremap basically > creates a new private mapping (unrealted to the original) and does not > modify the old mapping. > Yeah, in my experiment, after the mremap() exists we have two different VMAs which can contain two different set of data. No page sharing is happening. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 556866B04B8 for ; Mon, 10 Jul 2017 12:22:54 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id 77so25580360wrb.11 for ; Mon, 10 Jul 2017 09:22:54 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v81si7163767wmd.107.2017.07.10.09.22.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 10 Jul 2017 09:22:52 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> <79eca23d-9f1a-9713-3f6b-8f7598d53190@oracle.com> <662d372a-5737-5f0b-8ac1-c997f3a935eb@linux.vnet.ibm.com> From: Vlastimil Babka Message-ID: <223c0ede-1203-4ea6-0157-a4500fea8050@suse.cz> Date: Mon, 10 Jul 2017 18:22:04 +0200 MIME-Version: 1.0 In-Reply-To: <662d372a-5737-5f0b-8ac1-c997f3a935eb@linux.vnet.ibm.com> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Anshuman Khandual , Mike Kravetz , "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/09/2017 09:32 AM, Anshuman Khandual wrote: > On 07/07/2017 11:39 PM, Mike Kravetz wrote: >> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote: >>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote: >>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: >>>>> What is going to happen to mirrored after CoW for instance? >>>>> >>>>> In my opinion, it shouldn't be allowed for anon/private mappings at least. >>>>> And with this limitation, I don't see much sense in the new interface -- >>>>> just create mirror by mmap()ing the file again. >>>> >>>> The code today works for anon shared mappings. See simple program below. >>>> >>>> You are correct in that it makes little or no sense for private mappings. >>>> When looking closer at existing code, mremap() creates a new private >>>> mapping in this case. This is most likely a bug. >>> >>> IIRC, existing code doesn't create mirrors of private pages as it requires >>> old_len to be zero. There's no way to get private pages mapped twice this >>> way. >> >> Correct. >> As mentioned above, mremap does 'something' for private anon pages when >> old_len == 0. However, this may be considered a bug. In this case, mremap >> creates a new private anon mapping of length new_size. Since old_len == 0, >> it does not unmap any of the old mapping. So, in this case mremap basically >> creates a new private mapping (unrealted to the original) and does not >> modify the old mapping. >> > > Yeah, in my experiment, after the mremap() exists we have two different VMAs > which can contain two different set of data. No page sharing is happening. So how does this actually work for the JVM garbage collector use case? Aren't the garbage collected objects private anon? Anyway this should be documented. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f72.google.com (mail-vk0-f72.google.com [209.85.213.72]) by kanga.kvack.org (Postfix) with ESMTP id C268444084A for ; Mon, 10 Jul 2017 13:22:18 -0400 (EDT) Received: by mail-vk0-f72.google.com with SMTP id o19so41324135vkd.7 for ; Mon, 10 Jul 2017 10:22:18 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id 143si1555965vkn.160.2017.07.10.10.22.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Jul 2017 10:22:17 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170707102324.kfihkf72sjcrtn5b@node.shutemov.name> <20170707174534.wdfbciyfpovi52dy@node.shutemov.name> <79eca23d-9f1a-9713-3f6b-8f7598d53190@oracle.com> <662d372a-5737-5f0b-8ac1-c997f3a935eb@linux.vnet.ibm.com> <223c0ede-1203-4ea6-0157-a4500fea8050@suse.cz> From: Mike Kravetz Message-ID: Date: Mon, 10 Jul 2017 10:22:09 -0700 MIME-Version: 1.0 In-Reply-To: <223c0ede-1203-4ea6-0157-a4500fea8050@suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka , Anshuman Khandual , "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Michal Hocko , Aaron Lu , "Kirill A . Shutemov" On 07/10/2017 09:22 AM, Vlastimil Babka wrote: > On 07/09/2017 09:32 AM, Anshuman Khandual wrote: >> On 07/07/2017 11:39 PM, Mike Kravetz wrote: >>> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote: >>>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote: >>>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote: >>>>>> What is going to happen to mirrored after CoW for instance? >>>>>> >>>>>> In my opinion, it shouldn't be allowed for anon/private mappings at least. >>>>>> And with this limitation, I don't see much sense in the new interface -- >>>>>> just create mirror by mmap()ing the file again. >>>>> >>>>> The code today works for anon shared mappings. See simple program below. >>>>> >>>>> You are correct in that it makes little or no sense for private mappings. >>>>> When looking closer at existing code, mremap() creates a new private >>>>> mapping in this case. This is most likely a bug. >>>> >>>> IIRC, existing code doesn't create mirrors of private pages as it requires >>>> old_len to be zero. There's no way to get private pages mapped twice this >>>> way. >>> >>> Correct. >>> As mentioned above, mremap does 'something' for private anon pages when >>> old_len == 0. However, this may be considered a bug. In this case, mremap >>> creates a new private anon mapping of length new_size. Since old_len == 0, >>> it does not unmap any of the old mapping. So, in this case mremap basically >>> creates a new private mapping (unrealted to the original) and does not >>> modify the old mapping. >>> >> >> Yeah, in my experiment, after the mremap() exists we have two different VMAs >> which can contain two different set of data. No page sharing is happening. > > So how does this actually work for the JVM garbage collector use case? > Aren't the garbage collected objects private anon? Good point. The sample program the JVM team gave me uses a shared anon mapping. As you mention one would expect these mappings to be private. I have asked them for more details on their use case. > Anyway this should be documented. Yes, their prototype work seems to take advantage of this existing undocumented behavior. It seems we have been carrying this functionality for at least 13 years. It may be time to document. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 6C7186B04FB for ; Tue, 11 Jul 2017 08:36:46 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id l34so31426554wrc.12 for ; Tue, 11 Jul 2017 05:36:46 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z68si9354364wmz.6.2017.07.11.05.36.45 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 05:36:45 -0700 (PDT) Date: Tue, 11 Jul 2017 14:36:42 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170711123642.GC11936@dhcp22.suse.cz> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" On Thu 06-07-17 09:17:26, Mike Kravetz wrote: > The mremap system call has the ability to 'mirror' parts of an existing > mapping. To do so, it creates a new mapping that maps the same pages as > the original mapping, just at a different virtual address. This > functionality has existed since at least the 2.6 kernel. > > This patch simply adds a new flag to mremap which will make this > functionality part of the API. It maintains backward compatibility with > the existing way of requesting mirroring (old_size == 0). > > If this new MREMAP_MIRROR flag is specified, then new_size must equal > old_size. In addition, the MREMAP_MAYMOVE flag must be specified. I have to admit that this came as a suprise to me. There is no mention about this special case in the man page and the mremap code is so convoluted that I simply didn't see it there. I guess the only reasonable usecase is when you do not have a fd for the shared memory. Anyway the patch should fail with -EINVAL on private mappings as Kirill already pointed out and this should go along with an update to the man page which describes also the historical behavior. Make sure you document that this is not really a mirroring (e.g. faulting page in one address will automatically map it to the other mapping(s)) but merely a copy of the range. Maybe MREMAP_COPY would be more appropriate name. > Signed-off-by: Mike Kravetz > --- > include/uapi/linux/mman.h | 5 +++-- > mm/mremap.c | 23 ++++++++++++++++------- > tools/include/uapi/linux/mman.h | 5 +++-- > 3 files changed, 22 insertions(+), 11 deletions(-) > > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h > index ade4acd..6b3e0df 100644 > --- a/include/uapi/linux/mman.h > +++ b/include/uapi/linux/mman.h > @@ -3,8 +3,9 @@ > > #include > > -#define MREMAP_MAYMOVE 1 > -#define MREMAP_FIXED 2 > +#define MREMAP_MAYMOVE 0x01 > +#define MREMAP_FIXED 0x02 > +#define MREMAP_MIRROR 0x04 > > #define OVERCOMMIT_GUESS 0 > #define OVERCOMMIT_ALWAYS 1 > diff --git a/mm/mremap.c b/mm/mremap.c > index cd8a1b1..f18ab36 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX; > LIST_HEAD(uf_unmap); > > - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE)) > + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR)) > return ret; > > - if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE)) > + if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) && > + !(flags & MREMAP_MAYMOVE)) > return ret; > > if (offset_in_page(addr)) > @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > old_len = PAGE_ALIGN(old_len); > new_len = PAGE_ALIGN(new_len); > > - /* > - * We allow a zero old-len as a special case > - * for DOS-emu "duplicate shm area" thing. But > - * a zero new-len is nonsensical. > - */ > + /* A zero new-len is nonsensical. */ > if (!new_len) > return ret; > > + /* > + * For backward compatibility, we allow a zero old-len to imply > + * mirroring. This was originally a special case for DOS-emu. > + */ > + if (!old_len) > + flags |= MREMAP_MIRROR; > + else if (flags & MREMAP_MIRROR) { > + if (old_len != new_len) > + return ret; > + old_len = 0; > + } > + > if (down_write_killable(¤t->mm->mmap_sem)) > return -EINTR; > > diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h > index 81d8edf..069f7a5 100644 > --- a/tools/include/uapi/linux/mman.h > +++ b/tools/include/uapi/linux/mman.h > @@ -3,8 +3,9 @@ > > #include > > -#define MREMAP_MAYMOVE 1 > -#define MREMAP_FIXED 2 > +#define MREMAP_MAYMOVE 0x01 > +#define MREMAP_FIXED 0x02 > +#define MREMAP_MIRROR 0x04 > > #define OVERCOMMIT_GUESS 0 > #define OVERCOMMIT_ALWAYS 1 > -- > 2.7.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua0-f199.google.com (mail-ua0-f199.google.com [209.85.217.199]) by kanga.kvack.org (Postfix) with ESMTP id 73F426810B5 for ; Tue, 11 Jul 2017 14:23:30 -0400 (EDT) Received: by mail-ua0-f199.google.com with SMTP id w19so119067uac.0 for ; Tue, 11 Jul 2017 11:23:30 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id e35si3479uah.154.2017.07.11.11.23.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 11:23:29 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> From: Mike Kravetz Message-ID: <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> Date: Tue, 11 Jul 2017 11:23:19 -0700 MIME-Version: 1.0 In-Reply-To: <20170711123642.GC11936@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On 07/11/2017 05:36 AM, Michal Hocko wrote: > On Thu 06-07-17 09:17:26, Mike Kravetz wrote: >> The mremap system call has the ability to 'mirror' parts of an existing >> mapping. To do so, it creates a new mapping that maps the same pages as >> the original mapping, just at a different virtual address. This >> functionality has existed since at least the 2.6 kernel. >> >> This patch simply adds a new flag to mremap which will make this >> functionality part of the API. It maintains backward compatibility with >> the existing way of requesting mirroring (old_size == 0). >> >> If this new MREMAP_MIRROR flag is specified, then new_size must equal >> old_size. In addition, the MREMAP_MAYMOVE flag must be specified. > > I have to admit that this came as a suprise to me. There is no mention > about this special case in the man page and the mremap code is so > convoluted that I simply didn't see it there. I guess the only > reasonable usecase is when you do not have a fd for the shared memory. I was surprised as well when a JVM developer pointed this out. >>From the old e-mail thread, here is original use case: shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732 shmat(11337732, 0, 0) = 0x40299000 shmctl(11337732, IPC_RMID, 0) = 0 mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0 mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000 The JVM team wants to do something similar. They are using mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead of shmget/shmat. As Vlastimil mentioned previously, one would not expect a shared mapping for parts of the JVM heap. I am working to get clarification from the JVM team. > Anyway the patch should fail with -EINVAL on private mappings as Kirill > already pointed out Yes. I think this should be a separate patch. As mentioned earlier, mremap today creates a new/additional private mapping if called in this way with old_size == 0. To me, this is a bug. > and this should go along with an update to the > man page which describes also the historical behavior. Yes, man page updates are a must. One reason for the RFC was to determine if people thought we should: 1) Just document the existing old_size == 0 functionality 2) Create a more explicit interface such as a new mremap flag for this functionality I am waiting to see what direction people prefer before making any man page updates. > Make sure you > document that this is not really a mirroring (e.g. faulting page in one > address will automatically map it to the other mapping(s)) but merely a > copy of the range. Maybe MREMAP_COPY would be more appropriate name. Good point. mirror is the first word that came to mind, but it does not exactly apply. -- Mike Kravetz > >> Signed-off-by: Mike Kravetz >> --- >> include/uapi/linux/mman.h | 5 +++-- >> mm/mremap.c | 23 ++++++++++++++++------- >> tools/include/uapi/linux/mman.h | 5 +++-- >> 3 files changed, 22 insertions(+), 11 deletions(-) >> >> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h >> index ade4acd..6b3e0df 100644 >> --- a/include/uapi/linux/mman.h >> +++ b/include/uapi/linux/mman.h >> @@ -3,8 +3,9 @@ >> >> #include >> >> -#define MREMAP_MAYMOVE 1 >> -#define MREMAP_FIXED 2 >> +#define MREMAP_MAYMOVE 0x01 >> +#define MREMAP_FIXED 0x02 >> +#define MREMAP_MIRROR 0x04 >> >> #define OVERCOMMIT_GUESS 0 >> #define OVERCOMMIT_ALWAYS 1 >> diff --git a/mm/mremap.c b/mm/mremap.c >> index cd8a1b1..f18ab36 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c >> @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, >> struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX; >> LIST_HEAD(uf_unmap); >> >> - if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE)) >> + if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR)) >> return ret; >> >> - if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE)) >> + if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) && >> + !(flags & MREMAP_MAYMOVE)) >> return ret; >> >> if (offset_in_page(addr)) >> @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, >> old_len = PAGE_ALIGN(old_len); >> new_len = PAGE_ALIGN(new_len); >> >> - /* >> - * We allow a zero old-len as a special case >> - * for DOS-emu "duplicate shm area" thing. But >> - * a zero new-len is nonsensical. >> - */ >> + /* A zero new-len is nonsensical. */ >> if (!new_len) >> return ret; >> >> + /* >> + * For backward compatibility, we allow a zero old-len to imply >> + * mirroring. This was originally a special case for DOS-emu. >> + */ >> + if (!old_len) >> + flags |= MREMAP_MIRROR; >> + else if (flags & MREMAP_MIRROR) { >> + if (old_len != new_len) >> + return ret; >> + old_len = 0; >> + } >> + >> if (down_write_killable(¤t->mm->mmap_sem)) >> return -EINTR; >> >> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h >> index 81d8edf..069f7a5 100644 >> --- a/tools/include/uapi/linux/mman.h >> +++ b/tools/include/uapi/linux/mman.h >> @@ -3,8 +3,9 @@ >> >> #include >> >> -#define MREMAP_MAYMOVE 1 >> -#define MREMAP_FIXED 2 >> +#define MREMAP_MAYMOVE 0x01 >> +#define MREMAP_FIXED 0x02 >> +#define MREMAP_MIRROR 0x04 >> >> #define OVERCOMMIT_GUESS 0 >> #define OVERCOMMIT_ALWAYS 1 >> -- >> 2.7.5 >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f198.google.com (mail-qt0-f198.google.com [209.85.216.198]) by kanga.kvack.org (Postfix) with ESMTP id 17B806B04CA for ; Tue, 11 Jul 2017 17:03:00 -0400 (EDT) Received: by mail-qt0-f198.google.com with SMTP id g53so1854722qtc.6 for ; Tue, 11 Jul 2017 14:03:00 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id k1si393084qkd.166.2017.07.11.14.02.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 14:02:59 -0700 (PDT) Date: Tue, 11 Jul 2017 23:02:56 +0200 From: Andrea Arcangeli Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170711210256.GF22628@redhat.com> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Tue, Jul 11, 2017 at 11:23:19AM -0700, Mike Kravetz wrote: > I was surprised as well when a JVM developer pointed this out. > > From the old e-mail thread, here is original use case: > shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732 > shmat(11337732, 0, 0) = 0x40299000 > shmctl(11337732, IPC_RMID, 0) = 0 > mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0 > mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000 > > The JVM team wants to do something similar. They are using > mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead > of shmget/shmat. As Vlastimil mentioned previously, one would not > expect a shared mapping for parts of the JVM heap. I am working > to get clarification from the JVM team. Why don't they use memfd_create instead? That's made so that the fd is born anon unlinked so when the last reference is dropped all memory associated with it is automatically freed. No need of IC_RMID and then they can use mmap instead of mremap(len=0) to get a double map of it. If they use mmap(MAP_ANONYMOUS|MAP_SHARED) it's not hugetlbfs, that would have been the only issue. Using hugetlbfs for JVM wouldn't be really flexible, better they try to leverage THP on SHM or the hugetlbfs reservation gets in the way of efficient use of the unused memory for memory allocations that don't have a definitive size (i.e. JVM forks or more JVM are run in parallel). > Yes. I think this should be a separate patch. As mentioned earlier, > mremap today creates a new/additional private mapping if called in this > way with old_size == 0. To me, this is a bug. Kernel by sheer luck should stay stable, but the result is weird and it's unlikely intentional. memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will get the file pages correctly after a new mmap (even if there were cows in the old MAP_PRIVATE mmap). > One reason for the RFC was to determine if people thought we should: > 1) Just document the existing old_size == 0 functionality > 2) Create a more explicit interface such as a new mremap flag for this > functionality > > I am waiting to see what direction people prefer before making any > man page updates. I guess old_size == 0 would better be dropped if possible, if memfd_create fits perfectly your needs as I supposed above. If it's not dropped then it's not very far from allowing mmap of /proc/self/mm again (removed around so far as 2.3.x?). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f199.google.com (mail-io0-f199.google.com [209.85.223.199]) by kanga.kvack.org (Postfix) with ESMTP id DD4296810BE for ; Tue, 11 Jul 2017 17:57:46 -0400 (EDT) Received: by mail-io0-f199.google.com with SMTP id f1so4023713ioj.11 for ; Tue, 11 Jul 2017 14:57:46 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id r65si490337itc.28.2017.07.11.14.57.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 14:57:46 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170711210256.GF22628@redhat.com> From: Mike Kravetz Message-ID: Date: Tue, 11 Jul 2017 14:57:38 -0700 MIME-Version: 1.0 In-Reply-To: <20170711210256.GF22628@redhat.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On 07/11/2017 02:02 PM, Andrea Arcangeli wrote: > On Tue, Jul 11, 2017 at 11:23:19AM -0700, Mike Kravetz wrote: >> I was surprised as well when a JVM developer pointed this out. >> >> From the old e-mail thread, here is original use case: >> shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732 >> shmat(11337732, 0, 0) = 0x40299000 >> shmctl(11337732, IPC_RMID, 0) = 0 >> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0 >> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x100000) = 0x100000 >> >> The JVM team wants to do something similar. They are using >> mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead >> of shmget/shmat. As Vlastimil mentioned previously, one would not >> expect a shared mapping for parts of the JVM heap. I am working >> to get clarification from the JVM team. > > Why don't they use memfd_create instead? That's made so that the fd is > born anon unlinked so when the last reference is dropped all memory > associated with it is automatically freed. No need of IC_RMID and then > they can use mmap instead of mremap(len=0) to get a double map of it. Wow! I did not even know about memfd_create until you mentioned it. That would certainly work for 'normal' pages. > If they use mmap(MAP_ANONYMOUS|MAP_SHARED) it's not hugetlbfs, that > would have been the only issue. > > Using hugetlbfs for JVM wouldn't be really flexible, better they try > to leverage THP on SHM or the hugetlbfs reservation gets in the way of > efficient use of the unused memory for memory allocations that don't > have a definitive size (i.e. JVM forks or more JVM are run in > parallel). Well, the JVM has had a config option for the use of hugetlbfs for quite some time. I assume they have already had to deal with these issues. What prompted this discussion is that they want the mremap mirroring/ duplication functionality extended to support hugetlbfs. This is pretty straight forward. But, I wanted to have a discussion about whether the mremap(old_size == 0) functionality should be formally documented first. Do note that if you actually create/mount a hugetlbfs filesystem and use a fd in that filesystem you can get the desired functionality. However, they want to avoid this extra step if possible and use mmap(anon, hugetlb). I'm guessing that if memfd_create supported hugetlbfs, that would also meet their needs. Any thoughts about extending memfd_create support to hugetlbfs? I can't think of any big issues. In fact, 'under the covers' there actually is a hugetlbfs file created for anon mappings. However, that is not exposed to the user. >> Yes. I think this should be a separate patch. As mentioned earlier, >> mremap today creates a new/additional private mapping if called in this >> way with old_size == 0. To me, this is a bug. > > Kernel by sheer luck should stay stable, but the result is weird and > it's unlikely intentional. Yes, that is why I think it is a bug. Not that kernel is unstable, but rather the unintentional/unexpected result. > memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will > get the file pages correctly after a new mmap (even if there were cows > in the old MAP_PRIVATE mmap). > >> One reason for the RFC was to determine if people thought we should: >> 1) Just document the existing old_size == 0 functionality >> 2) Create a more explicit interface such as a new mremap flag for this >> functionality >> >> I am waiting to see what direction people prefer before making any >> man page updates. > > I guess old_size == 0 would better be dropped if possible, if > memfd_create fits perfectly your needs as I supposed above. If it's > not dropped then it's not very far from allowing mmap of /proc/self/mm > again (removed around so far as 2.3.x?). Yes, in my google'ing it appears the first users of mremap(old_size == 0) previously used mmap of /proc/self/mm. If memfd_create can be extended to support hugetlbfs, then I might suggest dropping the memfd_create(old_size == 0) support. Just a thought. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id 41E246810BE for ; Tue, 11 Jul 2017 19:31:19 -0400 (EDT) Received: by mail-qk0-f198.google.com with SMTP id z22so3073552qka.4 for ; Tue, 11 Jul 2017 16:31:19 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id b185si676996qkf.160.2017.07.11.16.31.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 16:31:18 -0700 (PDT) Date: Wed, 12 Jul 2017 01:31:14 +0200 From: Andrea Arcangeli Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170711233114.GH22628@redhat.com> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170711210256.GF22628@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Tue, Jul 11, 2017 at 02:57:38PM -0700, Mike Kravetz wrote: > Well, the JVM has had a config option for the use of hugetlbfs for quite > some time. I assume they have already had to deal with these issues. Yes, the config tweak exists well before THP existed but in production I know nobody who used it because as you start more processes you risk running out of hugetlbfs reservation and in addition the reservation "wastes memory" at times. > What prompted this discussion is that they want the mremap mirroring/ > duplication functionality extended to support hugetlbfs. This is pretty > straight forward. But, I wanted to have a discussion about whether the > mremap(old_size == 0) functionality should be formally documented first. Agreed. > Do note that if you actually create/mount a hugetlbfs filesystem and > use a fd in that filesystem you can get the desired functionality. However, > they want to avoid this extra step if possible and use mmap(anon, hugetlb). I see, I thought they needed to use the mremap on pure SHM because of the there was no MAP_HUGETLB in the mmap flags of the use case. > I'm guessing that if memfd_create supported hugetlbfs, that would also > meet their needs. Any thoughts about extending memfd_create support to > hugetlbfs? I can't think of any big issues. In fact, 'under the covers' > there actually is a hugetlbfs file created for anon mappings. However, > that is not exposed to the user. Yes, that should fit fine as MFD_HUGETLB or similar. > Yes, that is why I think it is a bug. Not that kernel is unstable, but > rather the unintentional/unexpected result. The most unexpected is the old mapping isn't wiped, at least it doesn't seem to cause trouble to anon as move_page_tables is nullified (old_end == old_addr so the loop never runs). > > memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will > > get the file pages correctly after a new mmap (even if there were cows > > in the old MAP_PRIVATE mmap). > > > >> One reason for the RFC was to determine if people thought we should: > >> 1) Just document the existing old_size == 0 functionality > >> 2) Create a more explicit interface such as a new mremap flag for this > >> functionality > >> > >> I am waiting to see what direction people prefer before making any > >> man page updates. > > > > I guess old_size == 0 would better be dropped if possible, if > > memfd_create fits perfectly your needs as I supposed above. If it's > > not dropped then it's not very far from allowing mmap of /proc/self/mm > > again (removed around so far as 2.3.x?). > > Yes, in my google'ing it appears the first users of mremap(old_size == 0) > previously used mmap of /proc/self/mm. > > If memfd_create can be extended to support hugetlbfs, then I might suggest > dropping the memfd_create(old_size == 0) support. Just a thought. memfd_create interface sounds more robust than this mremap trick, they would have to deal with one more fd that's all. old_len == 0 by nullifying move_page_tables will cause not harm to anon pages however the place where we would drop the vma is do_munmap here: if (vm_flags & VM_ACCOUNT) { vma->vm_flags &= ~VM_ACCOUNT; excess = vma->vm_end - vma->vm_start - old_len; [..] if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) { /* OOM: unable to split vma, just get accounts right */ vm_unacct_memory(excess >> PAGE_SHIFT); excess = 0; } It looks like a split_vma allocation failure can leave the old vma around in a equal way to old_len == 0 (but in such case all anon payload will have been moved to the new vma). That also seems safe as far as the kernel is concerned but it could cause userland failure if you depend on SIGSEGV to trigger later on the original vma you thought was implicitly munmapped (and in MAP_SHARED case it could even lead to unexpected file corruption instead of an expected SIGSEGV). If nobody ever depends on whatever is left on the old vma it's ok, but it could still leave file handle pinned unexpectedly if it's not anon. The other issue of the old_len = 0 trick is that will unexpectedly wipe the VM_ACCOUNT from the original vma as side effect of the above, but it'd only be noticeable if you care about strict accounting. So there is at least such one glitch in it. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id B97DC6B0507 for ; Wed, 12 Jul 2017 07:46:59 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id o105so3205829wrc.5 for ; Wed, 12 Jul 2017 04:46:59 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id h205si2008618wmf.32.2017.07.12.04.46.58 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 12 Jul 2017 04:46:58 -0700 (PDT) Date: Wed, 12 Jul 2017 13:46:55 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170712114655.GG28912@dhcp22.suse.cz> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Tue 11-07-17 11:23:19, Mike Kravetz wrote: > On 07/11/2017 05:36 AM, Michal Hocko wrote: [...] > > Anyway the patch should fail with -EINVAL on private mappings as Kirill > > already pointed out > > Yes. I think this should be a separate patch. As mentioned earlier, > mremap today creates a new/additional private mapping if called in this > way with old_size == 0. To me, this is a bug. Not only that. It clears existing ptes in the old mapping so the content is lost. That is quite unexpected behavior. Now it is hard to assume whether somebody relies on the behavior (I can easily imagine somebody doing backup&clear in atomic way) so failing with EINVAL might break userspace so I am not longer sure. Anyway this really needs to be documented. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vk0-f69.google.com (mail-vk0-f69.google.com [209.85.213.69]) by kanga.kvack.org (Postfix) with ESMTP id B2668440874 for ; Wed, 12 Jul 2017 12:55:58 -0400 (EDT) Received: by mail-vk0-f69.google.com with SMTP id p193so10058008vkd.11 for ; Wed, 12 Jul 2017 09:55:58 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id x9si1418009uab.245.2017.07.12.09.55.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 09:55:57 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> From: Mike Kravetz Message-ID: <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> Date: Wed, 12 Jul 2017 09:55:48 -0700 MIME-Version: 1.0 In-Reply-To: <20170712114655.GG28912@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On 07/12/2017 04:46 AM, Michal Hocko wrote: > On Tue 11-07-17 11:23:19, Mike Kravetz wrote: >> On 07/11/2017 05:36 AM, Michal Hocko wrote: > [...] >>> Anyway the patch should fail with -EINVAL on private mappings as Kirill >>> already pointed out >> >> Yes. I think this should be a separate patch. As mentioned earlier, >> mremap today creates a new/additional private mapping if called in this >> way with old_size == 0. To me, this is a bug. > > Not only that. It clears existing ptes in the old mapping so the content > is lost. That is quite unexpected behavior. Now it is hard to assume > whether somebody relies on the behavior (I can easily imagine somebody > doing backup&clear in atomic way) so failing with EINVAL might break > userspace so I am not longer sure. Anyway this really needs to be > documented. I am pretty sure it does not clear ptes in the old mapping, or modify it in any way. Are you thinking they are cleared as part of the call to move_page_tables? Since old_size == 0 (len as passed to move_page_tables), the for loop in move_page_tables is not run and it doesn't do much of anything in this case. My plan is to look into adding hugetlbfs support to memfd_create, as this would meet the user's needs. And, this is a much more sane API than this mremap(old_size == 0) behavior. If adding hugetlbfs support to memfd_create works out, I would like to see mremap(old_size == 0) support dropped. Nobody here (kernel mm development) seems to like it. However, as you note there may be somebody depending on this behavior. What would be the process for removing such support? AFAIK, it is not documented anywhere. If we do document the behavior, then we will certainly be stuck with it for a long time. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 697A9440874 for ; Thu, 13 Jul 2017 02:16:56 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id v88so8102704wrb.1 for ; Wed, 12 Jul 2017 23:16:56 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b204si4300929wmc.23.2017.07.12.23.16.54 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 12 Jul 2017 23:16:55 -0700 (PDT) Date: Thu, 13 Jul 2017 08:16:52 +0200 From: Michal Hocko Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170713061651.GA14492@dhcp22.suse.cz> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Wed 12-07-17 09:55:48, Mike Kravetz wrote: > On 07/12/2017 04:46 AM, Michal Hocko wrote: > > On Tue 11-07-17 11:23:19, Mike Kravetz wrote: > >> On 07/11/2017 05:36 AM, Michal Hocko wrote: > > [...] > >>> Anyway the patch should fail with -EINVAL on private mappings as Kirill > >>> already pointed out > >> > >> Yes. I think this should be a separate patch. As mentioned earlier, > >> mremap today creates a new/additional private mapping if called in this > >> way with old_size == 0. To me, this is a bug. > > > > Not only that. It clears existing ptes in the old mapping so the content > > is lost. That is quite unexpected behavior. Now it is hard to assume > > whether somebody relies on the behavior (I can easily imagine somebody > > doing backup&clear in atomic way) so failing with EINVAL might break > > userspace so I am not longer sure. Anyway this really needs to be > > documented. > > I am pretty sure it does not clear ptes in the old mapping, or modify it > in any way. Are you thinking they are cleared as part of the call to > move_page_tables? Since old_size == 0 (len as passed to move_page_tables), > the for loop in move_page_tables is not run and it doesn't do much of > anything in this case. Dang. I have completely missed that we give old_len as the len parameter. Then it is clear that this old_len == 0 trick never really worked for MAP_PRIVATE because it simply fails the main invariant that the content at the new location matches the old one. Care to send a patch to clarify that and sent EINVAL or should I do it? > My plan is to look into adding hugetlbfs support to memfd_create, as this > would meet the user's needs. And, this is a much more sane API than this > mremap(old_size == 0) behavior. agreed > If adding hugetlbfs support to memfd_create works out, I would like to > see mremap(old_size == 0) support dropped. Nobody here (kernel mm > development) seems to like it. However, as you note there may be somebody > depending on this behavior. What would be the process for removing > such support? AFAIK, it is not documented anywhere. If we do document > the behavior, then we will certainly be stuck with it for a long time. I would rather document it than remove it. From the past we know that there are users and my experience tells me that once something is used it lives its life for ever basically. And moreover it is not like this costs us any maintenance burden to support the hack. Just make it more obvious so that we do not have to rediscover it each time. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua0-f198.google.com (mail-ua0-f198.google.com [209.85.217.198]) by kanga.kvack.org (Postfix) with ESMTP id 0404A440874 for ; Thu, 13 Jul 2017 12:02:04 -0400 (EDT) Received: by mail-ua0-f198.google.com with SMTP id j1so21853854uah.3 for ; Thu, 13 Jul 2017 09:02:03 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id 21si22732vkg.6.2017.07.13.09.02.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 09:02:03 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> <20170713061651.GA14492@dhcp22.suse.cz> From: Mike Kravetz Message-ID: <21b264e7-b879-f072-03d2-f6f4aec5c957@oracle.com> Date: Thu, 13 Jul 2017 09:01:54 -0700 MIME-Version: 1.0 In-Reply-To: <20170713061651.GA14492@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Andrea Arcangeli , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On 07/12/2017 11:16 PM, Michal Hocko wrote: > On Wed 12-07-17 09:55:48, Mike Kravetz wrote: >> On 07/12/2017 04:46 AM, Michal Hocko wrote: >>> On Tue 11-07-17 11:23:19, Mike Kravetz wrote: >>>> On 07/11/2017 05:36 AM, Michal Hocko wrote: >>> [...] >>>>> Anyway the patch should fail with -EINVAL on private mappings as Kirill >>>>> already pointed out >>>> >>>> Yes. I think this should be a separate patch. As mentioned earlier, >>>> mremap today creates a new/additional private mapping if called in this >>>> way with old_size == 0. To me, this is a bug. >>> >>> Not only that. It clears existing ptes in the old mapping so the content >>> is lost. That is quite unexpected behavior. Now it is hard to assume >>> whether somebody relies on the behavior (I can easily imagine somebody >>> doing backup&clear in atomic way) so failing with EINVAL might break >>> userspace so I am not longer sure. Anyway this really needs to be >>> documented. >> >> I am pretty sure it does not clear ptes in the old mapping, or modify it >> in any way. Are you thinking they are cleared as part of the call to >> move_page_tables? Since old_size == 0 (len as passed to move_page_tables), >> the for loop in move_page_tables is not run and it doesn't do much of >> anything in this case. > > Dang. I have completely missed that we give old_len as the len > parameter. Then it is clear that this old_len == 0 trick never really > worked for MAP_PRIVATE because it simply fails the main invariant that > the content at the new location matches the old one. Care to send a > patch to clarify that and sent EINVAL or should I do it? Sent a patch (in separate e-mail thread) to return EINVAL for private mappings. >> If adding hugetlbfs support to memfd_create works out, I would like to >> see mremap(old_size == 0) support dropped. Nobody here (kernel mm >> development) seems to like it. However, as you note there may be somebody >> depending on this behavior. What would be the process for removing >> such support? AFAIK, it is not documented anywhere. If we do document >> the behavior, then we will certainly be stuck with it for a long time. > > I would rather document it than remove it. From the past we know that > there are users and my experience tells me that once something is used > it lives its life for ever basically. And moreover it is not like this > costs us any maintenance burden to support the hack. Just make it more > obvious so that we do not have to rediscover it each time. I will put together a patch to add a description of (old_size == 0) behavior to the man page. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 483E7440874 for ; Thu, 13 Jul 2017 12:30:58 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id q1so4977471qkb.3 for ; Thu, 13 Jul 2017 09:30:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id l128si5574379qkc.43.2017.07.13.09.30.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 09:30:57 -0700 (PDT) Date: Thu, 13 Jul 2017 18:30:54 +0200 From: Andrea Arcangeli Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170713163054.GK22628@redhat.com> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> <20170713061651.GA14492@dhcp22.suse.cz> <21b264e7-b879-f072-03d2-f6f4aec5c957@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <21b264e7-b879-f072-03d2-f6f4aec5c957@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Thu, Jul 13, 2017 at 09:01:54AM -0700, Mike Kravetz wrote: > Sent a patch (in separate e-mail thread) to return EINVAL for private > mappings. The way old_len == 0 behaves for MAP_PRIVATE seems more sane to me than the alternative of copying pagetables for anon pages (as behaving the way that way avoids to break anon pages invariants), despite it's not creating an exact mirror of what was in the original vma as it excludes any modification done to cowed anon pages. By nullifying move_page_tables old_len == 0 is simply duping the vma which is equivalent to a new mmap on the file for the MAP_PRIVATE case, it has a deterministic result. The real question is if it anybody is using it. So an alternative would be to start by adding a WARN_ON_ONCE deprecation warning instead of -EINVAL right away. The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side effect of using the old_len == 0 trick looks like a bug, I guess it should get fixed if we intend to keep old_len and document it for the long term. Overall I'm more concerned about the fact an allocation failure in do_munmap is unreported to userland and it will leave the old vma intact like old_len == 0 would do (unless I'm misreading something there). The VM_ACCOUNT wipe as side effect of old_len == 0 is not major short term concern. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id 6781F440874 for ; Thu, 13 Jul 2017 14:12:25 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id i71so76022473itf.2 for ; Thu, 13 Jul 2017 11:12:25 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id c187si19273ith.132.2017.07.13.11.12.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 11:12:24 -0700 (PDT) Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> <20170713061651.GA14492@dhcp22.suse.cz> <21b264e7-b879-f072-03d2-f6f4aec5c957@oracle.com> <20170713163054.GK22628@redhat.com> From: Mike Kravetz Message-ID: <28a8da13-bdc2-3f23-dee9-607377ac1cc3@oracle.com> Date: Thu, 13 Jul 2017 11:11:37 -0700 MIME-Version: 1.0 In-Reply-To: <20170713163054.GK22628@redhat.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On 07/13/2017 09:30 AM, Andrea Arcangeli wrote: > On Thu, Jul 13, 2017 at 09:01:54AM -0700, Mike Kravetz wrote: >> Sent a patch (in separate e-mail thread) to return EINVAL for private >> mappings. > > The way old_len == 0 behaves for MAP_PRIVATE seems more sane to me > than the alternative of copying pagetables for anon pages (as behaving > the way that way avoids to break anon pages invariants), despite it's > not creating an exact mirror of what was in the original vma as it > excludes any modification done to cowed anon pages. > > By nullifying move_page_tables old_len == 0 is simply duping the vma > which is equivalent to a new mmap on the file for the MAP_PRIVATE > case, it has a deterministic result. The real question is if it > anybody is using it. As previously discussed, copying pagetables (via move_page_tables) does not happen if old_len == 0. This is true for both for private and shared mappings. Here is my understanding of how things work for old_len == 0 of anon mappings: - shared mappings - New vma is created at new virtual address - vma refers to the same underlying object/pages as old vma - after mremap, no page tables exist for new vma, they are created as pages are accessed/faulted - page at new_address is same as page at old_address - private mappings - New vma is created at new virtual address - vma does not refer to same pages as old vma. It is a 'new' private anon mapping. - after mremap, no page tables exist for new vma. access to the range of the new vma will result in faults that allocate a new page. - page at new_address is different than page at old_address the new vma will result in new So, the result of mremap(old_len == 0) on a private mapping is that it simply creates a new private mapping. IMO, this is contrary to the purpose of mremap. mremap should return a mapping that is somehow related to the original mapping. Perhaps you are thinking about mremap of a private file mapping? I was not considering that case. I believe you are right. In this case a private COW mapping based on the original mapping would be created. So, this seems more in line with the intent of mremap. The new mapping is still related to the old mapping. With this in mind, what about returning EINVAL only for the anon private mapping case? However, if you have a fd (for a file mapping) then I can not see why someone would be using the old_len == 0 trick. It would be more straight forward to simply use mmap to create the additional mapping. > So an alternative would be to start by adding a WARN_ON_ONCE deprecation > warning instead of -EINVAL right away. > > The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side > effect of using the old_len == 0 trick looks like a bug, I guess it > should get fixed if we intend to keep old_len and document it for the > long term. Others seem to think we should keep old_len == 0 and document. > Overall I'm more concerned about the fact an allocation failure in > do_munmap is unreported to userland and it will leave the old vma > intact like old_len == 0 would do (unless I'm misreading something > there). The VM_ACCOUNT wipe as side effect of old_len == 0 is not > major short term concern. I assume you are concerned about the do_munmap call in move_vma? That does indeed look to be of concern. This happens AFTER setting up the new mapping. So, I'm thinking we should tear down the new mapping in the case do_munmap of the old mapping fails? That 'should' simply be a matter of: - moving page tables back to original mapping - remove/delete new vma - I don't think we need to 'unmap' the new vma as there should be no associated pages. I'll look into doing this as well. Just curious, do those userfaultfd callouts still work as desired in the case of map duplication (old_len == 0)? -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 5C121440874 for ; Thu, 13 Jul 2017 16:33:32 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id s20so31545034qki.12 for ; Thu, 13 Jul 2017 13:33:32 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id a11si5724719qtd.334.2017.07.13.13.33.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 13:33:30 -0700 (PDT) Date: Thu, 13 Jul 2017 22:33:27 +0200 From: Andrea Arcangeli Subject: Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality Message-ID: <20170713203327.GL22628@redhat.com> References: <1499357846-7481-1-git-send-email-mike.kravetz@oracle.com> <1499357846-7481-2-git-send-email-mike.kravetz@oracle.com> <20170711123642.GC11936@dhcp22.suse.cz> <7f14334f-81d1-7698-d694-37278f05a78e@oracle.com> <20170712114655.GG28912@dhcp22.suse.cz> <3a2cfeae-520c-b6e5-2808-cf1bcf62b067@oracle.com> <20170713061651.GA14492@dhcp22.suse.cz> <21b264e7-b879-f072-03d2-f6f4aec5c957@oracle.com> <20170713163054.GK22628@redhat.com> <28a8da13-bdc2-3f23-dee9-607377ac1cc3@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <28a8da13-bdc2-3f23-dee9-607377ac1cc3@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , linux-mm@kvack.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , Aaron Lu , "Kirill A . Shutemov" , Vlastimil Babka On Thu, Jul 13, 2017 at 11:11:37AM -0700, Mike Kravetz wrote: > Here is my understanding of how things work for old_len == 0 of anon > mappings: > - shared mappings > - New vma is created at new virtual address > - vma refers to the same underlying object/pages as old vma > - after mremap, no page tables exist for new vma, they are > created as pages are accessed/faulted > - page at new_address is same as page at old_address Yes, and this isn't backed by anon memory, it's backed by shmem. "Shared anon mapping" is really synonymous of shmem, the fact it's not a mmap of a tmpfs file is purely an API detail. > - private mappings > - New vma is created at new virtual address > - vma does not refer to same pages as old vma. It is a 'new' > private anon mapping. > - after mremap, no page tables exist for new vma. access to > the range of the new vma will result in faults that allocate > a new page. > - page at new_address is different than page at old_address > the new vma will result in new Yes, for a anon private mapping (so backed by real anonymous memory) no payload in the old vma could possibly go in the new vma. > So, the result of mremap(old_len == 0) on a private mapping is that it > simply creates a new private mapping. IMO, this is contrary to the purpose > of mremap. mremap should return a mapping that is somehow related to > the original mapping. I agree there's no point to ever use the mremap(old_len == 0) undocumented trick, to create a new anon private mmap, when you could use mmap instead and the result would be the same. So it's plausible nobody could use it for it. > Perhaps you are thinking about mremap of a private file mapping? I was > not considering that case. I believe you are right. In this case a > private COW mapping based on the original mapping would be created. So, > this seems more in line with the intent of mremap. The new mapping is > still related to the old mapping. Yes my earlier example was all about filebacked private mappings, to point out those also have a deterministic behavior with the old_len == 0 trick and it could be still used because the IPC_RMID was executed early on. The point is that you could always use a plain new mmap instead of the old_len == 0 trick, but that applies to shared mappings as well. My argument is that if you keep it and document it for shared anon mappings, I don't see something fundamentally wrong as keeping it for private filebacked mappings too as the shmat ID may have been deleted for those too. > With this in mind, what about returning EINVAL only for the anon private > mapping case? The only case where there's no excuse to use mremap(old_len == 0) as replacement for a new mmap is the private anon mappings case, so while it may still break something (as opposed to a deprecation warning), I guess the likely hood somebody is using it, is very low. > However, if you have a fd (for a file mapping) then I can not see why > someone would be using the old_len == 0 trick. It would be more straight > forward to simply use mmap to create the additional mapping. That applies to MAP_SHARED too and that's why deprecating the whole undocumented old_len ==0 sounded and still sound attractive to me, but doing it right away without a deprecation warning cycle, sounds too risky. > > So an alternative would be to start by adding a WARN_ON_ONCE deprecation > > warning instead of -EINVAL right away. > > > > The vma->vm_flags VM_ACCOUNT being wiped on the original vma as side > > effect of using the old_len == 0 trick looks like a bug, I guess it > > should get fixed if we intend to keep old_len and document it for the > > long term. > > Others seem to think we should keep old_len == 0 and document. The only case where it makes sense is after IPC_RMID, but with memfd_create there's no point anymore to use IPC_RMID. tmpfs/hugetlbfs/realfs files can be unlinked while the fd is still open so again no need of the mremap(old_len == 0) trick. Which is why I'd find it attractive to deprecate it if we could, but I assume we can't drop it even if undocumented, which is why I felt a deprecation warning would be suitable in this case (similar to deprecation warning of sysfs and then dropped via config option). I am assuming here that nobody is using it because it's undocumented and it has a bug in the VM_ACCOUNT code too. Without a deprecation warning it'd be hard to tell if the assumption is correct. > I assume you are concerned about the do_munmap call in move_vma? That Yes exactly. > does indeed look to be of concern. This happens AFTER setting up the > new mapping. So, I'm thinking we should tear down the new mapping in > the case do_munmap of the old mapping fails? That 'should' simply > be a matter of: > - moving page tables back to original mapping > - remove/delete new vma Yes. > - I don't think we need to 'unmap' the new vma as there should be no > associated pages. The new vma doesn't require memory allocations to drop as it was just created by copy_vma so there's no risk of further failures in the unwind. After the unwind it'll return -ENOMEM to userland (which we don't right now). > I'll look into doing this as well. It's mostly theoretical, the chances of an allocation failure triggering exactly in that split_vma are basically zero, but I think it'd be more correct and safer. > Just curious, do those userfaultfd callouts still work as desired in the > case of map duplication (old_len == 0)? old_len == 0 is fine with userfaultfd because, len == 0 returns -EINVAL in do_munmap before userfaultfd_unmap_prep is called. Still looking at the VM_ACCOUNT adjustments around do_munmap: mremap: /* Conceal VM_ACCOUNT so old reservation is not undone */ if (vm_flags & VM_ACCOUNT) { do_munmap: if (uf) { int error = userfaultfd_unmap_prep(vma, start, end, uf); if (error) return error; } /* * If we need to split any vma, do it now to save pain later. * * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially * unmapped vm_area_struct will remain in use: so lower split_vma * places tmp vma above, and higher split_vma places tmp vma below. */ I don't see this assumption where it matters that on do_munmap failure, mremap assumes the partially unmapped vma remains in use. In fact it's not partially unmapped at all, it's only split at the "start" address of the do_munmap but not unmapped. mremap caller simply sets excess = 0 and assumes it's all still mapped at the original vma as expected regardless of the order of the __split_vma executed in do_munmap. The whole VM_ACCOUNT logic in this place exists since the start of the git history so I can't see the change originating the above comment, but I assume the comment is wrong or simply confusing. I don't see a problem in userfaultfd_unmap_prep failing with -ENOMEM in relation to the VM_ACCOUNT logic above, before split_vma is called (callee doesn't seem to make assumption). However unrelated to mremap old_len == 0, but purely internal to do_munmap and theoretical, if either of the two __split_vma fails there's no need to send an unmap event and in fact it'd be wrong to, so userfaultfd_unmap_prep should be moved after both split_vma succeded. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org