From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B3CAC35280 for ; Sun, 10 May 2020 05:05:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5D92224960 for ; Sun, 10 May 2020 05:05:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1589087158; bh=vbN4FCynuJdq8UF08ebL/0cCBsxmV6T9aKAxAXwlaOU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:List-ID:From; b=myi7rF9VGfA+JnOpECeAumnLEpjO4/1xrutLIyH/FtXPfnGmZa3uPP+Svb86qR0Ta QuXPE9m+9mOoYvNCvnir3olvKKO/4MOWoWiVo7tfqdGNz3Ct70DxqipM54lxL20Jy3 6mwENB+3l5U4ZJtrvD61VyYc1VS0x0wDHnrfqRMc= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726321AbgEJFF5 (ORCPT ); Sun, 10 May 2020 01:05:57 -0400 Received: from mail.kernel.org ([198.145.29.99]:41534 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726106AbgEJFF5 (ORCPT ); Sun, 10 May 2020 01:05:57 -0400 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 5EC4B24964 for ; Sun, 10 May 2020 05:05:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1589087156; bh=vbN4FCynuJdq8UF08ebL/0cCBsxmV6T9aKAxAXwlaOU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=kssPEAiiG4pI+0Oi7Le8CCyWcIMpfRLaKap81BYw9kxGm9Ku9oMbuUevux1X7BR9o tlCSEOu9TfNaQnuRx03MOB57HXjGlymCOXZUkAUqVAy8iq+KqshT36JbVqQnHusO0X w2BbzV3kllzzR/5Lhg7BV6vEoU4q0z4RLEw5Jduo= Received: by mail-wm1-f53.google.com with SMTP id y24so15009333wma.4 for ; Sat, 09 May 2020 22:05:56 -0700 (PDT) X-Gm-Message-State: AGi0Pub8ZCZx/cExhzNqm51YCytVsGfdCIb/zmGTCesjxV2zOhH2au0E Wnd01+Gk2rD3J3DXBT2TzdlGdxOgC6ywn36mrvl7Wg== X-Google-Smtp-Source: APiQypI0KDg5AXXIYQmGcwfpwrQdJDQg6XvS3yh8r5do94CcWLBfbhtgNr6SJJ1dTLwyb6gEIpg6o9VFUE1OtV27i+Y= X-Received: by 2002:a7b:c5d3:: with SMTP id n19mr17292077wmk.21.1589087154678; Sat, 09 May 2020 22:05:54 -0700 (PDT) MIME-Version: 1.0 References: <20200508144043.13893-1-joro@8bytes.org> <20200508213609.GU8135@suse.de> <20200509175217.GV8135@suse.de> <20200509215713.GE18353@8bytes.org> In-Reply-To: <20200509215713.GE18353@8bytes.org> From: Andy Lutomirski Date: Sat, 9 May 2020 22:05:43 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings() To: Joerg Roedel Cc: Andy Lutomirski , Joerg Roedel , X86 ML , "H. Peter Anvin" , Dave Hansen , Peter Zijlstra , "Rafael J. Wysocki" , Arnd Bergmann , Andrew Morton , Steven Rostedt , Vlastimil Babka , Michal Hocko , LKML , Linux ACPI , linux-arch , Linux-MM Content-Type: text/plain; charset="UTF-8" Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org On Sat, May 9, 2020 at 2:57 PM Joerg Roedel wrote: > > Hi Andy, > > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > > So, unless I'm missing something here, there is an absolute maximum of > > 512 top-level entries that ever need to be synchronized. > > And here is where your assumption is wrong. On 32-bit PAE systems it is > not the top-level entries that need to be synchronized for vmalloc, but > the second-level entries. And dependent on the kernel configuration, > there are (in total, not only vmalloc) 1536, 1024, or 512 of these > second-level entries. How much of them are actually used for vmalloc > depends on the size of the system RAM (but is at least 64), because > the vmalloc area begins after the kernel direct-mapping (with an 8MB > unmapped hole). I spent some time looking at the code, and I'm guessing you're talking about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out what's going on. Can you explain what is actually going on that causes different mms/pgds to have top-level entries in the kernel range that point to different tables? Because I'm not seeing why this makes any sense. > > > Now, there's an additional complication. On x86_64, we have a rule: > > those entries that need to be synced start out null and may, during > > the lifetime of the system, change *once*. They are never unmapped or > > modified after being allocated. This means that those entries can > > only ever point to a page *table* and not to a ginormous page. So, > > even if the hardware were to support ginormous pages (which, IIRC, it > > doesn't), we would be limited to merely immense and not ginormous > > pages in the vmalloc range. On x86_32, I don't think we have this > > rule right now. And this means that it's possible for one of these > > pages to be unmapped or modified. > > The reason for x86-32 being different is that the address space is > orders of magnitude smaller than on x86-64. We just have 4 top-level > entries with PAE paging and can't afford to partition kernel-adress > space on that level like we do on x86-64. That is the reason the address > space is partitioned on the second (PMD) level, which is also the reason > vmalloc synchronization needs to happen on that level. And because > that's not enough yet, its also the page-table level to map huge-pages. Why does it need to be partitioned at all? The only thing that comes to mind is that the LDT range is per-mm. So I can imagine that the PAE case with a 3G user / 1G kernel split has to have the vmalloc range and the LDT range in the same top-level entry. Yuck. > > > So my suggestion is that just apply the x86_64 rule to x86_32 as well. > > The practical effect will be that 2-level-paging systems will not be > > able to use huge pages in the vmalloc range, since the rule will be > > that the vmalloc-relevant entries in the top-level table must point to > > page *tables* instead of huge pages. > > I could very well live with prohibiting huge-page ioremap mappings for > x86-32. But as I wrote before, this doesn't solve the problems I am > trying to address with this patch-set, or would only address them if > significant amount of total system memory is used. > > The pre-allocation solution would work for x86-64, it would only need > 256kb of preallocated memory for the vmalloc range to never synchronize > or fault again. I have thought about that and did the math before > writing this patch-set, but doing the math for 32 bit drove me away from > it for reasons written above. > If it's *just* the LDT that's a problem, we could plausibly shrink the user address range a little bit and put the LDT in the user portion. I suppose this could end up creating its own set of problems involving tracking which code owns which page tables. > And since a lot of the vmalloc_sync_(un)mappings problems I debugged > were actually related to 32-bit, I want a solution that works for 32 and > 64-bit x86 (at least until support for x86-32 is removed). And I think > this patch-set provides a solution that works well for both. I'm not fundamentally objecting to your patch set, but I do want to understand what's going on that needs this stuff. > > > Joerg From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AA03C54E7F for ; Sun, 10 May 2020 05:06:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E335F20737 for ; Sun, 10 May 2020 05:05:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="kssPEAii" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E335F20737 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7A497900017; Sun, 10 May 2020 01:05:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7553D8E0003; Sun, 10 May 2020 01:05:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6447A900017; Sun, 10 May 2020 01:05:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0249.hostedemail.com [216.40.44.249]) by kanga.kvack.org (Postfix) with ESMTP id 498748E0003 for ; Sun, 10 May 2020 01:05:58 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 059CB8248068 for ; Sun, 10 May 2020 05:05:58 +0000 (UTC) X-FDA: 76799622396.13.end89_16ec28709e92a X-HE-Tag: end89_16ec28709e92a X-Filterd-Recvd-Size: 7165 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Sun, 10 May 2020 05:05:57 +0000 (UTC) Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com [209.85.128.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 40B5D2137B for ; Sun, 10 May 2020 05:05:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1589087156; bh=vbN4FCynuJdq8UF08ebL/0cCBsxmV6T9aKAxAXwlaOU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=kssPEAiiG4pI+0Oi7Le8CCyWcIMpfRLaKap81BYw9kxGm9Ku9oMbuUevux1X7BR9o tlCSEOu9TfNaQnuRx03MOB57HXjGlymCOXZUkAUqVAy8iq+KqshT36JbVqQnHusO0X w2BbzV3kllzzR/5Lhg7BV6vEoU4q0z4RLEw5Jduo= Received: by mail-wm1-f52.google.com with SMTP id u16so14945502wmc.5 for ; Sat, 09 May 2020 22:05:56 -0700 (PDT) X-Gm-Message-State: AGi0PubKDPPi6TQ0eJtmB+wYUVlKchX9oSt8tBRmA/ROwuwO10Prxd4S 8j1yHyv4s9cqaCPEDo5dD0tR/eo2zGePOuSVbAJbVA== X-Google-Smtp-Source: APiQypI0KDg5AXXIYQmGcwfpwrQdJDQg6XvS3yh8r5do94CcWLBfbhtgNr6SJJ1dTLwyb6gEIpg6o9VFUE1OtV27i+Y= X-Received: by 2002:a7b:c5d3:: with SMTP id n19mr17292077wmk.21.1589087154678; Sat, 09 May 2020 22:05:54 -0700 (PDT) MIME-Version: 1.0 References: <20200508144043.13893-1-joro@8bytes.org> <20200508213609.GU8135@suse.de> <20200509175217.GV8135@suse.de> <20200509215713.GE18353@8bytes.org> In-Reply-To: <20200509215713.GE18353@8bytes.org> From: Andy Lutomirski Date: Sat, 9 May 2020 22:05:43 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings() To: Joerg Roedel Cc: Andy Lutomirski , Joerg Roedel , X86 ML , "H. Peter Anvin" , Dave Hansen , Peter Zijlstra , "Rafael J. Wysocki" , Arnd Bergmann , Andrew Morton , Steven Rostedt , Vlastimil Babka , Michal Hocko , LKML , Linux ACPI , linux-arch , Linux-MM Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, May 9, 2020 at 2:57 PM Joerg Roedel wrote: > > Hi Andy, > > On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > > So, unless I'm missing something here, there is an absolute maximum of > > 512 top-level entries that ever need to be synchronized. > > And here is where your assumption is wrong. On 32-bit PAE systems it is > not the top-level entries that need to be synchronized for vmalloc, but > the second-level entries. And dependent on the kernel configuration, > there are (in total, not only vmalloc) 1536, 1024, or 512 of these > second-level entries. How much of them are actually used for vmalloc > depends on the size of the system RAM (but is at least 64), because > the vmalloc area begins after the kernel direct-mapping (with an 8MB > unmapped hole). I spent some time looking at the code, and I'm guessing you're talking about the 3-level !SHARED_KERNEL_PMD case. I can't quite figure out what's going on. Can you explain what is actually going on that causes different mms/pgds to have top-level entries in the kernel range that point to different tables? Because I'm not seeing why this makes any sense. > > > Now, there's an additional complication. On x86_64, we have a rule: > > those entries that need to be synced start out null and may, during > > the lifetime of the system, change *once*. They are never unmapped or > > modified after being allocated. This means that those entries can > > only ever point to a page *table* and not to a ginormous page. So, > > even if the hardware were to support ginormous pages (which, IIRC, it > > doesn't), we would be limited to merely immense and not ginormous > > pages in the vmalloc range. On x86_32, I don't think we have this > > rule right now. And this means that it's possible for one of these > > pages to be unmapped or modified. > > The reason for x86-32 being different is that the address space is > orders of magnitude smaller than on x86-64. We just have 4 top-level > entries with PAE paging and can't afford to partition kernel-adress > space on that level like we do on x86-64. That is the reason the address > space is partitioned on the second (PMD) level, which is also the reason > vmalloc synchronization needs to happen on that level. And because > that's not enough yet, its also the page-table level to map huge-pages. Why does it need to be partitioned at all? The only thing that comes to mind is that the LDT range is per-mm. So I can imagine that the PAE case with a 3G user / 1G kernel split has to have the vmalloc range and the LDT range in the same top-level entry. Yuck. > > > So my suggestion is that just apply the x86_64 rule to x86_32 as well. > > The practical effect will be that 2-level-paging systems will not be > > able to use huge pages in the vmalloc range, since the rule will be > > that the vmalloc-relevant entries in the top-level table must point to > > page *tables* instead of huge pages. > > I could very well live with prohibiting huge-page ioremap mappings for > x86-32. But as I wrote before, this doesn't solve the problems I am > trying to address with this patch-set, or would only address them if > significant amount of total system memory is used. > > The pre-allocation solution would work for x86-64, it would only need > 256kb of preallocated memory for the vmalloc range to never synchronize > or fault again. I have thought about that and did the math before > writing this patch-set, but doing the math for 32 bit drove me away from > it for reasons written above. > If it's *just* the LDT that's a problem, we could plausibly shrink the user address range a little bit and put the LDT in the user portion. I suppose this could end up creating its own set of problems involving tracking which code owns which page tables. > And since a lot of the vmalloc_sync_(un)mappings problems I debugged > were actually related to 32-bit, I want a solution that works for 32 and > 64-bit x86 (at least until support for x86-32 is removed). And I think > this patch-set provides a solution that works well for both. I'm not fundamentally objecting to your patch set, but I do want to understand what's going on that needs this stuff. > > > Joerg