From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E33EC433E6 for ; Tue, 2 Feb 2021 09:35:13 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6B85464F09 for ; Tue, 2 Feb 2021 09:35:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B85464F09 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id C8BB9100EB856; Tue, 2 Feb 2021 01:35:11 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=195.135.220.15; helo=mx2.suse.de; envelope-from=mhocko@suse.com; receiver= Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 96BF7100EBBDE for ; Tue, 2 Feb 2021 01:35:08 -0800 (PST) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612258506; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Q54UyAniLH3WTtqS2/eEFFAOsABqgVttGqq1oJHHl3o=; b=pwRKjSfmAAUl4rpfKIVMY6cCCIH0A1qOnaV8W5VBLKM6omzrGDOVmq/p9akJdMt/mFgkO8 V2tc8GFVYU80SWASGLlyhGF0EaS9YLNXv/f0XhQU0L4onN+4ztKCsUfGmCzaoW6V54AWau ShSpuMgXIb5koikt2IimucJXrn9ZlD0= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9998AB171; Tue, 2 Feb 2021 09:35:06 +0000 (UTC) Date: Tue, 2 Feb 2021 10:35:05 +0100 From: Michal Hocko To: James Bottomley Subject: Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation Message-ID: References: <20210121122723.3446-1-rppt@kernel.org> <20210121122723.3446-8-rppt@kernel.org> <20210126114657.GL827@dhcp22.suse.cz> <303f348d-e494-e386-d1f5-14505b5da254@redhat.com> <20210126120823.GM827@dhcp22.suse.cz> <20210128092259.GB242749@kernel.org> <73738cda43236b5ac2714e228af362b67a712f5d.camel@linux.ibm.com> <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> Message-ID-Hash: VOWVA2ERGKMYZHX2NNVGFMIY7L3JQ6OJ X-Message-ID-Hash: VOWVA2ERGKMYZHX2NNVGFMIY7L3JQ6OJ X-MailFrom: mhocko@suse.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation CC: Mike Rapoport , David Hildenbrand , Andrew Morton , Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Ander sen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon 01-02-21 08:56:19, James Bottomley wrote: > On Fri, 2021-01-29 at 09:23 +0100, Michal Hocko wrote: > > On Thu 28-01-21 13:05:02, James Bottomley wrote: > > > Obviously the API choice could be revisited > > > but do you have anything to add over the previous discussion, or is > > > this just to get your access control? > > > > Well, access control is certainly one thing which I still believe is > > missing. But if there is a general agreement that the direct map > > manipulation is not that critical then this will become much less of > > a problem of course. > > The secret memory is a scarce resource but it's not a facility that > should only be available to some users. How those two objectives go along? Or maybe our understanding of what scrace really means here. If the pool of the secret memory is very limited then you really need a way to stop one party from depriving others. More on that below. > > It all boils down whether secret memory is a scarce resource. With > > the existing implementation it really is. It is effectivelly > > repeating same design errors as hugetlb did. And look now, we have a > > subtle and convoluted reservation code to track mmap requests and we > > have a cgroup controller to, guess what, have at least some control > > over distribution if the preallocated pool. See where am I coming > > from? > > I'm fairly sure rlimit is the correct way to control this. The > subtlety in both rlimit and memcg tracking comes from deciding to > account under an existing category rather than having our own new one. > People don't like new stuff in accounting because it requires > modifications to everything in userspace. Accounting under and > existing limit keeps userspace the same but leads to endless arguments > about which limit it should be under. It took us several patch set > iterations to get to a fragile consensus on this which you're now > disrupting for reasons you're not making clear. I hoped I had made my points really clear. The existing scheme allows one users (potentially adversary) to deplete the preallocated pool and cause a shitstorm of OOM killer because there is no real way to replenish the pool from the oom killer other than randomly keep killing tasks until one happens to release its secret memory back to the pool. Is that more clear now? And no, rlimit and memcg limit will not save you from that because the former is per process and later is hard to manage under a single limit which might be order of magnitude larger than the secret memory pool size. See the point? I have also proposed potential ways out of this. Either the pool is not fixed sized and you make it a regular unevictable memory (if direct map fragmentation is not considered a major problem) or you need a careful access control or you need SIGBUS on the mmap failure (to allow at least some fallback mode to caller). I do not see any other way around it. I might be missing some other ways but so far I keep hearing that the existing scheme is just fine because this has been discussed in the past and you have agreed it is ok. Without any specifics... Please keep in mind this is a user interface and it is due to careful scrutiny. So rather than pushing back with "you are disrupting a consensus" kinda feedback, please try to stay technical. > > If the secret memory is more in line with mlock without any imposed > > limit (other than available memory) in the end then, sure, using the > > same access control as mlock sounds reasonable. Btw. if this is > > really just a more restrictive mlock then is there any reason to not > > hook this into the existing mlock infrastructure (e.g. > > MCL_EXCLUSIVE)? Implications would be that direct map would be > > handled on instantiation/tear down paths, migration would deal with > > the same (if possible). Other than that it would be mlock like. > > In the very first patch set we proposed a mmap flag to do this. Under > detailed probing it emerged that this suffers from several design > problems: the KVM people want VMM to be able to remove the secret > memory range from the process; there may be situations where sharing is > useful and some people want to be able to seal the operations. All of > this ended up convincing everyone that a file descriptor based approach > was better than a mmap one. OK, fair enough. This belongs to the changelog IMHO. It is good to know why existing interfaces do not match the need. -- Michal Hocko SUSE Labs _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE389C433E0 for ; Tue, 2 Feb 2021 09:38:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6F37E64D99 for ; Tue, 2 Feb 2021 09:38:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233020AbhBBJh5 (ORCPT ); Tue, 2 Feb 2021 04:37:57 -0500 Received: from mx2.suse.de ([195.135.220.15]:56880 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233019AbhBBJfy (ORCPT ); Tue, 2 Feb 2021 04:35:54 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612258506; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Q54UyAniLH3WTtqS2/eEFFAOsABqgVttGqq1oJHHl3o=; b=pwRKjSfmAAUl4rpfKIVMY6cCCIH0A1qOnaV8W5VBLKM6omzrGDOVmq/p9akJdMt/mFgkO8 V2tc8GFVYU80SWASGLlyhGF0EaS9YLNXv/f0XhQU0L4onN+4ztKCsUfGmCzaoW6V54AWau ShSpuMgXIb5koikt2IimucJXrn9ZlD0= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9998AB171; Tue, 2 Feb 2021 09:35:06 +0000 (UTC) Date: Tue, 2 Feb 2021 10:35:05 +0100 From: Michal Hocko To: James Bottomley Cc: Mike Rapoport , David Hildenbrand , Andrew Morton , Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt Subject: Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation Message-ID: References: <20210121122723.3446-1-rppt@kernel.org> <20210121122723.3446-8-rppt@kernel.org> <20210126114657.GL827@dhcp22.suse.cz> <303f348d-e494-e386-d1f5-14505b5da254@redhat.com> <20210126120823.GM827@dhcp22.suse.cz> <20210128092259.GB242749@kernel.org> <73738cda43236b5ac2714e228af362b67a712f5d.camel@linux.ibm.com> <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 01-02-21 08:56:19, James Bottomley wrote: > On Fri, 2021-01-29 at 09:23 +0100, Michal Hocko wrote: > > On Thu 28-01-21 13:05:02, James Bottomley wrote: > > > Obviously the API choice could be revisited > > > but do you have anything to add over the previous discussion, or is > > > this just to get your access control? > > > > Well, access control is certainly one thing which I still believe is > > missing. But if there is a general agreement that the direct map > > manipulation is not that critical then this will become much less of > > a problem of course. > > The secret memory is a scarce resource but it's not a facility that > should only be available to some users. How those two objectives go along? Or maybe our understanding of what scrace really means here. If the pool of the secret memory is very limited then you really need a way to stop one party from depriving others. More on that below. > > It all boils down whether secret memory is a scarce resource. With > > the existing implementation it really is. It is effectivelly > > repeating same design errors as hugetlb did. And look now, we have a > > subtle and convoluted reservation code to track mmap requests and we > > have a cgroup controller to, guess what, have at least some control > > over distribution if the preallocated pool. See where am I coming > > from? > > I'm fairly sure rlimit is the correct way to control this. The > subtlety in both rlimit and memcg tracking comes from deciding to > account under an existing category rather than having our own new one. > People don't like new stuff in accounting because it requires > modifications to everything in userspace. Accounting under and > existing limit keeps userspace the same but leads to endless arguments > about which limit it should be under. It took us several patch set > iterations to get to a fragile consensus on this which you're now > disrupting for reasons you're not making clear. I hoped I had made my points really clear. The existing scheme allows one users (potentially adversary) to deplete the preallocated pool and cause a shitstorm of OOM killer because there is no real way to replenish the pool from the oom killer other than randomly keep killing tasks until one happens to release its secret memory back to the pool. Is that more clear now? And no, rlimit and memcg limit will not save you from that because the former is per process and later is hard to manage under a single limit which might be order of magnitude larger than the secret memory pool size. See the point? I have also proposed potential ways out of this. Either the pool is not fixed sized and you make it a regular unevictable memory (if direct map fragmentation is not considered a major problem) or you need a careful access control or you need SIGBUS on the mmap failure (to allow at least some fallback mode to caller). I do not see any other way around it. I might be missing some other ways but so far I keep hearing that the existing scheme is just fine because this has been discussed in the past and you have agreed it is ok. Without any specifics... Please keep in mind this is a user interface and it is due to careful scrutiny. So rather than pushing back with "you are disrupting a consensus" kinda feedback, please try to stay technical. > > If the secret memory is more in line with mlock without any imposed > > limit (other than available memory) in the end then, sure, using the > > same access control as mlock sounds reasonable. Btw. if this is > > really just a more restrictive mlock then is there any reason to not > > hook this into the existing mlock infrastructure (e.g. > > MCL_EXCLUSIVE)? Implications would be that direct map would be > > handled on instantiation/tear down paths, migration would deal with > > the same (if possible). Other than that it would be mlock like. > > In the very first patch set we proposed a mmap flag to do this. Under > detailed probing it emerged that this suffers from several design > problems: the KVM people want VMM to be able to remove the secret > memory range from the process; there may be situations where sharing is > useful and some people want to be able to seal the operations. All of > this ended up convincing everyone that a file descriptor based approach > was better than a mmap one. OK, fair enough. This belongs to the changelog IMHO. It is good to know why existing interfaces do not match the need. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A674C433DB for ; Tue, 2 Feb 2021 09:35:23 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C7C6164F09 for ; Tue, 2 Feb 2021 09:35:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C7C6164F09 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:Message-ID: Subject:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=9BzyiX46b1BlFbOph5jYI8dUMXCLe5gO1VtN74Nna9M=; b=T1iY699OQ1kmlaQSUmizCpD9A eYEBeoFgLM17nN0UAF1Is6IoLjrDSZYaiV57+VLEYUZf16/A+rtEp/7OKarJUxhaxWu7Yc75vKeuo p+GMzbEAhLU65P4NE5Rjz0TCXZkHJxIwVFY0OoQz4vSJn9RdKuvK7lpenrDNoWbYJyeDxNFtDp6fo Uk3iRvPEXohRMt0jKvpn5nPTKrM/hCNvgA9d+YierW1zJshhJol/WD//MDyL6PR444/s392OjFOVa ydgK8zWpFvAPhMdhMfIi+Wfw/y0xiXt1GlWGuUkW4PhmKoLOSZBdVmjc1q1EptXngirCir/0aw7ok PGeG5irWg==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1l6s5a-0001xu-6K; Tue, 02 Feb 2021 09:35:14 +0000 Received: from mx2.suse.de ([195.135.220.15]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1l6s5V-0001vr-Ar; Tue, 02 Feb 2021 09:35:11 +0000 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612258506; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Q54UyAniLH3WTtqS2/eEFFAOsABqgVttGqq1oJHHl3o=; b=pwRKjSfmAAUl4rpfKIVMY6cCCIH0A1qOnaV8W5VBLKM6omzrGDOVmq/p9akJdMt/mFgkO8 V2tc8GFVYU80SWASGLlyhGF0EaS9YLNXv/f0XhQU0L4onN+4ztKCsUfGmCzaoW6V54AWau ShSpuMgXIb5koikt2IimucJXrn9ZlD0= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9998AB171; Tue, 2 Feb 2021 09:35:06 +0000 (UTC) Date: Tue, 2 Feb 2021 10:35:05 +0100 From: Michal Hocko To: James Bottomley Subject: Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation Message-ID: References: <20210121122723.3446-1-rppt@kernel.org> <20210121122723.3446-8-rppt@kernel.org> <20210126114657.GL827@dhcp22.suse.cz> <303f348d-e494-e386-d1f5-14505b5da254@redhat.com> <20210126120823.GM827@dhcp22.suse.cz> <20210128092259.GB242749@kernel.org> <73738cda43236b5ac2714e228af362b67a712f5d.camel@linux.ibm.com> <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210202_043509_651058_3BE6358F X-CRM114-Status: GOOD ( 39.56 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , David Hildenbrand , Peter Zijlstra , Catalin Marinas , Dave Hansen , linux-mm@kvack.org, linux-kselftest@vger.kernel.org, "H. Peter Anvin" , Christopher Lameter , Shuah Khan , Thomas Gleixner , Elena Reshetova , linux-arch@vger.kernel.org, Tycho Andersen , linux-nvdimm@lists.01.org, Will Deacon , x86@kernel.org, Matthew Wilcox , Mike Rapoport , Ingo Molnar , Michael Kerrisk , Palmer Dabbelt , Arnd Bergmann , Hagen Paul Pfeifer , Borislav Petkov , Alexander Viro , Andy Lutomirski , Paul Walmsley , "Kirill A. Shutemov" , Dan Williams , linux-arm-kernel@lists.infradead.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Palmer Dabbelt , linux-fsdevel@vger.kernel.org, Shakeel Butt , Andrew Morton , Rick Edgecombe , Roman Gushchin , Mike Rapoport Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org On Mon 01-02-21 08:56:19, James Bottomley wrote: > On Fri, 2021-01-29 at 09:23 +0100, Michal Hocko wrote: > > On Thu 28-01-21 13:05:02, James Bottomley wrote: > > > Obviously the API choice could be revisited > > > but do you have anything to add over the previous discussion, or is > > > this just to get your access control? > > > > Well, access control is certainly one thing which I still believe is > > missing. But if there is a general agreement that the direct map > > manipulation is not that critical then this will become much less of > > a problem of course. > > The secret memory is a scarce resource but it's not a facility that > should only be available to some users. How those two objectives go along? Or maybe our understanding of what scrace really means here. If the pool of the secret memory is very limited then you really need a way to stop one party from depriving others. More on that below. > > It all boils down whether secret memory is a scarce resource. With > > the existing implementation it really is. It is effectivelly > > repeating same design errors as hugetlb did. And look now, we have a > > subtle and convoluted reservation code to track mmap requests and we > > have a cgroup controller to, guess what, have at least some control > > over distribution if the preallocated pool. See where am I coming > > from? > > I'm fairly sure rlimit is the correct way to control this. The > subtlety in both rlimit and memcg tracking comes from deciding to > account under an existing category rather than having our own new one. > People don't like new stuff in accounting because it requires > modifications to everything in userspace. Accounting under and > existing limit keeps userspace the same but leads to endless arguments > about which limit it should be under. It took us several patch set > iterations to get to a fragile consensus on this which you're now > disrupting for reasons you're not making clear. I hoped I had made my points really clear. The existing scheme allows one users (potentially adversary) to deplete the preallocated pool and cause a shitstorm of OOM killer because there is no real way to replenish the pool from the oom killer other than randomly keep killing tasks until one happens to release its secret memory back to the pool. Is that more clear now? And no, rlimit and memcg limit will not save you from that because the former is per process and later is hard to manage under a single limit which might be order of magnitude larger than the secret memory pool size. See the point? I have also proposed potential ways out of this. Either the pool is not fixed sized and you make it a regular unevictable memory (if direct map fragmentation is not considered a major problem) or you need a careful access control or you need SIGBUS on the mmap failure (to allow at least some fallback mode to caller). I do not see any other way around it. I might be missing some other ways but so far I keep hearing that the existing scheme is just fine because this has been discussed in the past and you have agreed it is ok. Without any specifics... Please keep in mind this is a user interface and it is due to careful scrutiny. So rather than pushing back with "you are disrupting a consensus" kinda feedback, please try to stay technical. > > If the secret memory is more in line with mlock without any imposed > > limit (other than available memory) in the end then, sure, using the > > same access control as mlock sounds reasonable. Btw. if this is > > really just a more restrictive mlock then is there any reason to not > > hook this into the existing mlock infrastructure (e.g. > > MCL_EXCLUSIVE)? Implications would be that direct map would be > > handled on instantiation/tear down paths, migration would deal with > > the same (if possible). Other than that it would be mlock like. > > In the very first patch set we proposed a mmap flag to do this. Under > detailed probing it emerged that this suffers from several design > problems: the KVM people want VMM to be able to remove the secret > memory range from the process; there may be situations where sharing is > useful and some people want to be able to seal the operations. All of > this ended up convincing everyone that a file descriptor based approach > was better than a mmap one. OK, fair enough. This belongs to the changelog IMHO. It is good to know why existing interfaces do not match the need. -- Michal Hocko SUSE Labs _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A3ADC433DB for ; Tue, 2 Feb 2021 09:36:29 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0926664E9B for ; Tue, 2 Feb 2021 09:36:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0926664E9B Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:Message-ID: Subject:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=/+WLqMY/4fsc4ZMhWjDNdNDGUVUb+u722Za2vOHuHlA=; b=JzvBVgOjrvqVluJokoipQ1tdj LE8RXALbDGp0t6SENiQQ5lA66GwNpr44UZyQckH2wE3a90gDJXTlPZGC9mi7EMrpCoWDNLqNw11li wyDVeacNw6CuVgVR3MTkxn4/vRKfLlJrmBrg/SBxzyqpI9otGgilbZyzZSdSDRpl3/sZMnm4Skan2 nGeXjLk/ZFxuFpDHgF9Y6gdeIprV35bTIFhei1+BD65VSXpe4VxhwWwqFHzOtlbWRTHGyaDJYVrpg +jAIdl2YSTzPIXrFHEqU+AfaZ1909OyPeKwYNb4WKnXKun9nJVmpFfdy5kTBgBxtsm3bJOYbxv3pU M/TOU59iA==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1l6s5Z-0001xU-3u; Tue, 02 Feb 2021 09:35:13 +0000 Received: from mx2.suse.de ([195.135.220.15]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1l6s5V-0001vr-Ar; Tue, 02 Feb 2021 09:35:11 +0000 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612258506; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Q54UyAniLH3WTtqS2/eEFFAOsABqgVttGqq1oJHHl3o=; b=pwRKjSfmAAUl4rpfKIVMY6cCCIH0A1qOnaV8W5VBLKM6omzrGDOVmq/p9akJdMt/mFgkO8 V2tc8GFVYU80SWASGLlyhGF0EaS9YLNXv/f0XhQU0L4onN+4ztKCsUfGmCzaoW6V54AWau ShSpuMgXIb5koikt2IimucJXrn9ZlD0= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 9998AB171; Tue, 2 Feb 2021 09:35:06 +0000 (UTC) Date: Tue, 2 Feb 2021 10:35:05 +0100 From: Michal Hocko To: James Bottomley Subject: Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation Message-ID: References: <20210121122723.3446-1-rppt@kernel.org> <20210121122723.3446-8-rppt@kernel.org> <20210126114657.GL827@dhcp22.suse.cz> <303f348d-e494-e386-d1f5-14505b5da254@redhat.com> <20210126120823.GM827@dhcp22.suse.cz> <20210128092259.GB242749@kernel.org> <73738cda43236b5ac2714e228af362b67a712f5d.camel@linux.ibm.com> <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <6de6b9f9c2d28eecc494e7db6ffbedc262317e11.camel@linux.ibm.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210202_043509_651058_3BE6358F X-CRM114-Status: GOOD ( 39.56 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , David Hildenbrand , Peter Zijlstra , Catalin Marinas , Dave Hansen , linux-mm@kvack.org, linux-kselftest@vger.kernel.org, "H. Peter Anvin" , Christopher Lameter , Shuah Khan , Thomas Gleixner , Elena Reshetova , linux-arch@vger.kernel.org, Tycho Andersen , linux-nvdimm@lists.01.org, Will Deacon , x86@kernel.org, Matthew Wilcox , Mike Rapoport , Ingo Molnar , Michael Kerrisk , Palmer Dabbelt , Arnd Bergmann , Hagen Paul Pfeifer , Borislav Petkov , Alexander Viro , Andy Lutomirski , Paul Walmsley , "Kirill A. Shutemov" , Dan Williams , linux-arm-kernel@lists.infradead.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Palmer Dabbelt , linux-fsdevel@vger.kernel.org, Shakeel Butt , Andrew Morton , Rick Edgecombe , Roman Gushchin , Mike Rapoport Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon 01-02-21 08:56:19, James Bottomley wrote: > On Fri, 2021-01-29 at 09:23 +0100, Michal Hocko wrote: > > On Thu 28-01-21 13:05:02, James Bottomley wrote: > > > Obviously the API choice could be revisited > > > but do you have anything to add over the previous discussion, or is > > > this just to get your access control? > > > > Well, access control is certainly one thing which I still believe is > > missing. But if there is a general agreement that the direct map > > manipulation is not that critical then this will become much less of > > a problem of course. > > The secret memory is a scarce resource but it's not a facility that > should only be available to some users. How those two objectives go along? Or maybe our understanding of what scrace really means here. If the pool of the secret memory is very limited then you really need a way to stop one party from depriving others. More on that below. > > It all boils down whether secret memory is a scarce resource. With > > the existing implementation it really is. It is effectivelly > > repeating same design errors as hugetlb did. And look now, we have a > > subtle and convoluted reservation code to track mmap requests and we > > have a cgroup controller to, guess what, have at least some control > > over distribution if the preallocated pool. See where am I coming > > from? > > I'm fairly sure rlimit is the correct way to control this. The > subtlety in both rlimit and memcg tracking comes from deciding to > account under an existing category rather than having our own new one. > People don't like new stuff in accounting because it requires > modifications to everything in userspace. Accounting under and > existing limit keeps userspace the same but leads to endless arguments > about which limit it should be under. It took us several patch set > iterations to get to a fragile consensus on this which you're now > disrupting for reasons you're not making clear. I hoped I had made my points really clear. The existing scheme allows one users (potentially adversary) to deplete the preallocated pool and cause a shitstorm of OOM killer because there is no real way to replenish the pool from the oom killer other than randomly keep killing tasks until one happens to release its secret memory back to the pool. Is that more clear now? And no, rlimit and memcg limit will not save you from that because the former is per process and later is hard to manage under a single limit which might be order of magnitude larger than the secret memory pool size. See the point? I have also proposed potential ways out of this. Either the pool is not fixed sized and you make it a regular unevictable memory (if direct map fragmentation is not considered a major problem) or you need a careful access control or you need SIGBUS on the mmap failure (to allow at least some fallback mode to caller). I do not see any other way around it. I might be missing some other ways but so far I keep hearing that the existing scheme is just fine because this has been discussed in the past and you have agreed it is ok. Without any specifics... Please keep in mind this is a user interface and it is due to careful scrutiny. So rather than pushing back with "you are disrupting a consensus" kinda feedback, please try to stay technical. > > If the secret memory is more in line with mlock without any imposed > > limit (other than available memory) in the end then, sure, using the > > same access control as mlock sounds reasonable. Btw. if this is > > really just a more restrictive mlock then is there any reason to not > > hook this into the existing mlock infrastructure (e.g. > > MCL_EXCLUSIVE)? Implications would be that direct map would be > > handled on instantiation/tear down paths, migration would deal with > > the same (if possible). Other than that it would be mlock like. > > In the very first patch set we proposed a mmap flag to do this. Under > detailed probing it emerged that this suffers from several design > problems: the KVM people want VMM to be able to remove the secret > memory range from the process; there may be situations where sharing is > useful and some people want to be able to seal the operations. All of > this ended up convincing everyone that a file descriptor based approach > was better than a mmap one. OK, fair enough. This belongs to the changelog IMHO. It is good to know why existing interfaces do not match the need. -- Michal Hocko SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel