From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1786FCA9EAF for ; Wed, 30 Oct 2019 08:40:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DB8DB20874 for ; Wed, 30 Oct 2019 08:40:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572424816; bh=/nCth3tAN3zo9YpLVlMHUApCRm+0ZcTx184ZTc9xlgM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=cZhFS2QNp6U8fl5O8y6AjZTOZxGH2VaJqEbxK3q2p3O4PBLk/Xbq3FGvIhP3x+HNE VHsEr04z3unNtPwxFLwmKmSS7JXJXvxYerIQfU4svXeQJ8UO6LxUDVJTbS73e8lU2s 2K0m9SpkxpU5vDtaZBeYEsqeUeqmEMxmgDEo80Ug= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726315AbfJ3IkP (ORCPT ); Wed, 30 Oct 2019 04:40:15 -0400 Received: from mail.kernel.org ([198.145.29.99]:58964 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726028AbfJ3IkP (ORCPT ); Wed, 30 Oct 2019 04:40:15 -0400 Received: from rapoport-lnx (190.228.71.37.rev.sfr.net [37.71.228.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 04AFE20856; Wed, 30 Oct 2019 08:40:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572424814; bh=/nCth3tAN3zo9YpLVlMHUApCRm+0ZcTx184ZTc9xlgM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Nesn8HlZl7oj3aP21LaeInYQ+pjC5N1ru/Ic4LIsTmwnHs/f8XRQSOkx71rnYe1Au RSpTABrOh3VDdlU2zTYQNbqqVTXVwsnRNGFTBvg7cUfwoi8YwW1psOJ5Z6Fbj5mZP2 f+9qYKTxXa82mHdjLJcKrDzqXxxuQtR9XPh7IhV0= Date: Wed, 30 Oct 2019 09:40:06 +0100 From: Mike Rapoport To: Andy Lutomirski Cc: LKML , Alexey Dobriyan , Andrew Morton , Arnd Bergmann , Borislav Petkov , Dave Hansen , James Bottomley , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Linux API , Linux-MM , X86 ML , Mike Rapoport Subject: Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Message-ID: <20191030084005.GC20624@rapoport-lnx> References: <1572171452-7958-1-git-send-email-rppt@kernel.org> <20191029093254.GE18773@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote: > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport wrote: > > > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote: > > > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport wrote: > > > > > > > > From: Mike Rapoport > > > > > > > > Hi, > > > > > > > > The patch below aims to allow applications to create mappins that have > > > > pages visible only to the owning process. Such mappings could be used to > > > > store secrets so that these secrets are not visible neither to other > > > > processes nor to the kernel. > > > > > > > > I've only tested the basic functionality, the changes should be verified > > > > against THP/migration/compaction. Yet, I'd appreciate early feedback. > > > > > > I’ve contemplated the concept a fair amount, and I think you should > > > consider a change to the API. In particular, rather than having it be a > > > MAP_ flag, make it a chardev. You can, at least at first, allow only > > > MAP_SHARED, and admins can decide who gets to use it. It might also play > > > better with the VM overall, and you won’t need a VM_ flag for it — you > > > can just wire up .fault to do the right thing. > > > > I think mmap()/mprotect()/madvise() are the natural APIs for such > > interface. > > Then you have a whole bunch of questions to answer. For example: > > What happens if you mprotect() or similar when the mapping is already > in use in a way that's incompatible with MAP_EXCLUSIVE? Then we refuse to mprotect()? Like in any other case when vm_flags are not compatible with required madvise()/mprotect() operation. > Is it actually reasonable to malloc() some memory and then make it exclusive? > > Are you permitted to map a file MAP_EXCLUSIVE? What does it mean? I'd limit MAP_EXCLUSIVE only to anonymous memory. > What does MAP_PRIVATE | MAP_EXCLUSIVE do? My preference is to have only mmap() and then the semantics is more clear: MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked and drops the pages in this region from the direct map. The pages are returned back on munmap(). Then there is no way to change an existing area to be exclusive or vice versa. > How does one pass exclusive memory via SCM_RIGHTS? (If it's a > memfd-like or chardev interface, it's trivial. mmap(), not so much.) Why passing such memory via SCM_RIGHTS would be useful? > And finally, there's my personal giant pet peeve: a major use of this > will be for virtualization. I suspect that a lot of people would like > the majority of KVM guest memory to be unmapped from the host > pagetables. But people might also like for guest memory to be > unmapped in *QEMU's* pagetables, and mmap() is a basically worthless > interface for this. Getting fd-backed memory into a guest will take > some possibly major work in the kernel, but getting vma-backed memory > into a guest without mapping it in the host user address space seems > much, much worse. Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets rather than use it for the entire guest memory. I even considered adding a limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is anyway enforced there is no need for a new one. I agree that getting fd-backed memory into a guest would be less pain that VMA, but KVM can already use memory outside the control of the kernel via /dev/map [1]. So unless I'm missing something here, there is no need to use MAP_EXCLUSIVE for the guest memory. [1] https://lwn.net/Articles/778240/ > > Switching to a chardev doesn't solve the major problem of direct > > map fragmentation and defeats the ability to use exclusive memory mappings > > with the existing allocators, while mprotect() and madvise() do not. > > > > Will people really want to do malloc() and then remap it exclusive? > This sounds dubiously useful at best. Again, my preference is to have mmap() only, but I see a value in this use case as well. Application developers allocate memory and then sometimes change its properties rather than go mmap() something. For such usage mprotect() may be usefull. -- Sincerely yours, Mike.