From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90BF9C5519F for ; Sun, 22 Nov 2020 07:30:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4806A2078E for ; Sun, 22 Nov 2020 07:30:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=bytedance-com.20150623.gappssmtp.com header.i=@bytedance-com.20150623.gappssmtp.com header.b="MFWHJNou" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727367AbgKVHaU (ORCPT ); Sun, 22 Nov 2020 02:30:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43008 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727244AbgKVHaU (ORCPT ); Sun, 22 Nov 2020 02:30:20 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF78FC0613D3 for ; Sat, 21 Nov 2020 23:30:19 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id v5so7908515pff.10 for ; Sat, 21 Nov 2020 23:30:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=KYx1eNdrI2dF7Hg46yl8EzwrSDaEmC+lVYIEoOmHAZg=; b=MFWHJNoutPM2IP/2KjbFlh07dajFWsb4OR5iaq7tTrbrBiv1J9ipJHp7s2T+9CIqUs G4Q2B02f6N6HpHvCZmvqQtAhRbHzDe2c2HX8d6nn6lyr7MFldRlwe3VFc76kEzpEFnr3 mR2UY7qxIURkaNtpi7Fe7OARFhB0wpuXwb9VXc6H0bou1ALOkYZ/uQGHUrupS9WLoMXf 5M6BxjInvATYW4fpAAuOvy+LXdyMt/KH+73/AFs47qzgS0SKPax1Ibz9IUB95OlHmniq wWFiOvlDmmTkmoTGp/+lBpNToNGrirR0vKxUY6lW3s8vwZFHA4iDBaE97V8tnAi2klQb mBPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=KYx1eNdrI2dF7Hg46yl8EzwrSDaEmC+lVYIEoOmHAZg=; b=BSvb4zxWimcR8S97yUemvkH1ELvzXaf8+IgN9kANSlvpZYkqzRMxk9HZyfwgyH4YQ5 3h2gso1fJJmqUTfl4Zxui6gLSJopLgtXFw6/6fDryvlaOn4x98v8ROnTN8oPSMokEIE+ Nh77ELfXFqDCNRIm4k0ULiDuKusTn02RjwL9Hn1hEPnQJkO5muq9AAs4SPG8R+aTHIKf EryR3CdLXiv0PfIW8HdG8i45oUZHoyK5zqtA3qhNqcyUijCu0zJFQey/9eIzajdjk9dh seU9kcKWYKJP4Pe0eUw+n+MgGYJPsxwTA15KrY0QmVlngJRXd2PwRVDLyJk1OL/roh9i jWEg== X-Gm-Message-State: AOAM531CtJdyH/GyLFnD5FuO5obqXvFcdhC8ymMKfnzjvJcfG4BkRuYG f7NXDHdvds3U5tcbZKUpWUCfKshEQOXdq926K2ggyg== X-Google-Smtp-Source: ABdhPJzwOPZM4fxIMmlJL214kfq5JIf5rNiHXammfFM4uFN7bttIzKgBVBMoHgWIZEjM9tTva2o+eVDdjfOFUivMjxk= X-Received: by 2002:a17:90a:ae14:: with SMTP id t20mr19309393pjq.13.1606030219110; Sat, 21 Nov 2020 23:30:19 -0800 (PST) MIME-Version: 1.0 References: <20201120064325.34492-1-songmuchun@bytedance.com> <20201120084202.GJ3200@dhcp22.suse.cz> <6b1533f7-69c6-6f19-fc93-c69750caaecc@redhat.com> <20201120093912.GM3200@dhcp22.suse.cz> <55e53264-a07a-a3ec-4253-e72c718b4ee6@oracle.com> In-Reply-To: <55e53264-a07a-a3ec-4253-e72c718b4ee6@oracle.com> From: Muchun Song Date: Sun, 22 Nov 2020 15:29:40 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page To: Mike Kravetz Cc: David Hildenbrand , Michal Hocko , Jonathan Corbet , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , "Song Bao Hua (Barry Song)" , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz wrot= e: > > On 11/20/20 1:43 AM, David Hildenbrand wrote: > > On 20.11.20 10:39, Michal Hocko wrote: > >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote: > >>> On 20.11.20 09:42, Michal Hocko wrote: > >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote: > >>>> [...] > >>>> > >>>> Thanks for improving the cover letter and providing some numbers. I = have > >>>> only glanced through the patchset because I didn't really have more = time > >>>> to dive depply into them. > >>>> > >>>> Overall it looks promissing. To summarize. I would prefer to not hav= e > >>>> the feature enablement controlled by compile time option and the ker= nel > >>>> command line option should be opt-in. I also do not like that freein= g > >>>> the pool can trigger the oom killer or even shut the system down if = no > >>>> oom victim is eligible. > >>>> > >>>> One thing that I didn't really get to think hard about is what is th= e > >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > >>>> invalid when racing with the split. How do we enforce that this won'= t > >>>> blow up? > >>> > >>> I have the same concerns - the sections are online the whole time and > >>> anybody with pfn_to_online_page() can grab them > >>> > >>> I think we have similar issues with memory offlining when removing th= e > >>> vmemmap, it's just very hard to trigger and we can easily protect by > >>> grabbing the memhotplug lock. > >> > >> I am not sure we can/want to span memory hotplug locking out to all pf= n > >> walkers. But you are right that the underlying problem is similar but > >> much harder to trigger because vmemmaps are only removed when the > >> physical memory is hotremoved and that happens very seldom. Maybe it > >> will happen more with virtualization usecases. But this work makes it > >> even more tricky. If a pfn walker races with a hotremove then it would > >> just blow up when accessing the unmapped physical address space. For > >> this feature a pfn walker would just grab a real struct page re-used f= or > >> some unpredictable use under its feet. Any failure would be silent and > >> hard to debug. > > > > Right, we don't want the memory hotplug locking, thus discussions regar= ding rcu. Luckily, for now I never saw a BUG report regarding this - maybe = because the time between memory offlining (offline_pages()) and memory/vmem= map getting removed (try_remove_memory()) is just too long. Someone would h= ave to sleep after pfn_to_online_page() for quite a while to trigger it. > > > >> > >> [...] > >>> To keep things easy, maybe simply never allow to free these hugetlb p= ages > >>> again for now? If they were reserved during boot and the vmemmap cond= ensed, > >>> then just let them stick around for all eternity. > >> > >> Not sure I understand. Do you propose to only free those vmemmap pages > >> when the pool is initialized during boot time and never allow to free > >> them up? That would certainly make it safer and maybe even simpler wrt > >> implementation. > > > > Exactly, let's keep it simple for now. I guess most use cases of this (= virtualization, databases, ...) will allocate hugepages during boot and nev= er free them. > > Not sure if I agree with that last statement. Database and virtualizatio= n > use cases from my employer allocate allocate hugetlb pages after boot. I= t > is shortly after boot, but still not from boot/kernel command line. > > Somewhat related, but not exactly addressing this issue ... > > One idea discussed in a previous patch set was to disable PMD/huge page > mapping of vmemmap if this feature was enabled. This would eliminate a b= unch > of the complex code doing page table manipulation. It does not address > the issue of struct page pages going away which is being discussed here, > but it could be a way to simply the first version of this code. If this > is going to be an 'opt in' feature as previously suggested, then eliminat= ing > the PMD/huge page vmemmap mapping may be acceptable. My guess is that > sysadmins would only 'opt in' if they expect most of system memory to be = used > by hugetlb pages. We certainly have database and virtualization use case= s > where this is true. Hi Mike, Yeah, I agree with you that the first version of this feature should be simply. I can do that (disable PMD/huge page mapping of vmemmap) in the next version patch. But I have another question: what the problem is when struct page pages go away? I have not understood the issues discussed here, hope you can answer for me. Thanks. > -- > Mike Kravetz --=20 Yours, Muchun From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C35CC5519F for ; Sun, 22 Nov 2020 07:30:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EB0B82078E for ; Sun, 22 Nov 2020 07:30:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=bytedance-com.20150623.gappssmtp.com header.i=@bytedance-com.20150623.gappssmtp.com header.b="MFWHJNou" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EB0B82078E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0F42B6B0068; Sun, 22 Nov 2020 02:30:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0CB226B006E; Sun, 22 Nov 2020 02:30:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFB156B0070; Sun, 22 Nov 2020 02:30:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0126.hostedemail.com [216.40.44.126]) by kanga.kvack.org (Postfix) with ESMTP id C38F46B0068 for ; Sun, 22 Nov 2020 02:30:21 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5CA8D3648 for ; Sun, 22 Nov 2020 07:30:21 +0000 (UTC) X-FDA: 77511231042.16.star44_2f0d71d2735b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id 3FFF4100E690C for ; Sun, 22 Nov 2020 07:30:21 +0000 (UTC) X-HE-Tag: star44_2f0d71d2735b X-Filterd-Recvd-Size: 8641 Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Sun, 22 Nov 2020 07:30:20 +0000 (UTC) Received: by mail-pg1-f194.google.com with SMTP id 34so11284388pgp.10 for ; Sat, 21 Nov 2020 23:30:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=KYx1eNdrI2dF7Hg46yl8EzwrSDaEmC+lVYIEoOmHAZg=; b=MFWHJNoutPM2IP/2KjbFlh07dajFWsb4OR5iaq7tTrbrBiv1J9ipJHp7s2T+9CIqUs G4Q2B02f6N6HpHvCZmvqQtAhRbHzDe2c2HX8d6nn6lyr7MFldRlwe3VFc76kEzpEFnr3 mR2UY7qxIURkaNtpi7Fe7OARFhB0wpuXwb9VXc6H0bou1ALOkYZ/uQGHUrupS9WLoMXf 5M6BxjInvATYW4fpAAuOvy+LXdyMt/KH+73/AFs47qzgS0SKPax1Ibz9IUB95OlHmniq wWFiOvlDmmTkmoTGp/+lBpNToNGrirR0vKxUY6lW3s8vwZFHA4iDBaE97V8tnAi2klQb mBPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=KYx1eNdrI2dF7Hg46yl8EzwrSDaEmC+lVYIEoOmHAZg=; b=AWnkw/IQVk7HNuGYyaJadYqpPesCFNAWuvNHeAirHM5AEMulZykBSXKvqA/MVujurc 9LUz3ZBYzHTDSDUKEEzJ9UNHRpunukXvcoUyxVnJKTydV/AWkBTRJtgjN/UMdGSmJ3KE k44kRneraOu6eYuIdsOj30PSnHyVccWxLVB5qKCxsLZqCh6oIE5JnWl9EqrKNhrGArhH oZdAvab4Ut5KU44txIbyhEv7gKYVPldkf1hnVrH0Fhs/rtgGGvNQQ9MgUE0QAWjmZwbc NJYHYmAhC75XCgmrRZdD7JtNG1strDwqOQXsFYBZfKjVdKYqueBvaEeEZaaG/qSQwpff cLqA== X-Gm-Message-State: AOAM533cP0gPp7Kk3ANYu4lYxj1Q/53Hqu4DO8V+WLQgFyR8FLl4iqIN IcVcPQA1g2CUHBmdg2Zk+pw7ArPYB4FHfp0WsnhZyA== X-Google-Smtp-Source: ABdhPJzwOPZM4fxIMmlJL214kfq5JIf5rNiHXammfFM4uFN7bttIzKgBVBMoHgWIZEjM9tTva2o+eVDdjfOFUivMjxk= X-Received: by 2002:a17:90a:ae14:: with SMTP id t20mr19309393pjq.13.1606030219110; Sat, 21 Nov 2020 23:30:19 -0800 (PST) MIME-Version: 1.0 References: <20201120064325.34492-1-songmuchun@bytedance.com> <20201120084202.GJ3200@dhcp22.suse.cz> <6b1533f7-69c6-6f19-fc93-c69750caaecc@redhat.com> <20201120093912.GM3200@dhcp22.suse.cz> <55e53264-a07a-a3ec-4253-e72c718b4ee6@oracle.com> In-Reply-To: <55e53264-a07a-a3ec-4253-e72c718b4ee6@oracle.com> From: Muchun Song Date: Sun, 22 Nov 2020 15:29:40 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page To: Mike Kravetz Cc: David Hildenbrand , Michal Hocko , Jonathan Corbet , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , "Song Bao Hua (Barry Song)" , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz wrot= e: > > On 11/20/20 1:43 AM, David Hildenbrand wrote: > > On 20.11.20 10:39, Michal Hocko wrote: > >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote: > >>> On 20.11.20 09:42, Michal Hocko wrote: > >>>> On Fri 20-11-20 14:43:04, Muchun Song wrote: > >>>> [...] > >>>> > >>>> Thanks for improving the cover letter and providing some numbers. I = have > >>>> only glanced through the patchset because I didn't really have more = time > >>>> to dive depply into them. > >>>> > >>>> Overall it looks promissing. To summarize. I would prefer to not hav= e > >>>> the feature enablement controlled by compile time option and the ker= nel > >>>> command line option should be opt-in. I also do not like that freein= g > >>>> the pool can trigger the oom killer or even shut the system down if = no > >>>> oom victim is eligible. > >>>> > >>>> One thing that I didn't really get to think hard about is what is th= e > >>>> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > >>>> invalid when racing with the split. How do we enforce that this won'= t > >>>> blow up? > >>> > >>> I have the same concerns - the sections are online the whole time and > >>> anybody with pfn_to_online_page() can grab them > >>> > >>> I think we have similar issues with memory offlining when removing th= e > >>> vmemmap, it's just very hard to trigger and we can easily protect by > >>> grabbing the memhotplug lock. > >> > >> I am not sure we can/want to span memory hotplug locking out to all pf= n > >> walkers. But you are right that the underlying problem is similar but > >> much harder to trigger because vmemmaps are only removed when the > >> physical memory is hotremoved and that happens very seldom. Maybe it > >> will happen more with virtualization usecases. But this work makes it > >> even more tricky. If a pfn walker races with a hotremove then it would > >> just blow up when accessing the unmapped physical address space. For > >> this feature a pfn walker would just grab a real struct page re-used f= or > >> some unpredictable use under its feet. Any failure would be silent and > >> hard to debug. > > > > Right, we don't want the memory hotplug locking, thus discussions regar= ding rcu. Luckily, for now I never saw a BUG report regarding this - maybe = because the time between memory offlining (offline_pages()) and memory/vmem= map getting removed (try_remove_memory()) is just too long. Someone would h= ave to sleep after pfn_to_online_page() for quite a while to trigger it. > > > >> > >> [...] > >>> To keep things easy, maybe simply never allow to free these hugetlb p= ages > >>> again for now? If they were reserved during boot and the vmemmap cond= ensed, > >>> then just let them stick around for all eternity. > >> > >> Not sure I understand. Do you propose to only free those vmemmap pages > >> when the pool is initialized during boot time and never allow to free > >> them up? That would certainly make it safer and maybe even simpler wrt > >> implementation. > > > > Exactly, let's keep it simple for now. I guess most use cases of this (= virtualization, databases, ...) will allocate hugepages during boot and nev= er free them. > > Not sure if I agree with that last statement. Database and virtualizatio= n > use cases from my employer allocate allocate hugetlb pages after boot. I= t > is shortly after boot, but still not from boot/kernel command line. > > Somewhat related, but not exactly addressing this issue ... > > One idea discussed in a previous patch set was to disable PMD/huge page > mapping of vmemmap if this feature was enabled. This would eliminate a b= unch > of the complex code doing page table manipulation. It does not address > the issue of struct page pages going away which is being discussed here, > but it could be a way to simply the first version of this code. If this > is going to be an 'opt in' feature as previously suggested, then eliminat= ing > the PMD/huge page vmemmap mapping may be acceptable. My guess is that > sysadmins would only 'opt in' if they expect most of system memory to be = used > by hugetlb pages. We certainly have database and virtualization use case= s > where this is true. Hi Mike, Yeah, I agree with you that the first version of this feature should be simply. I can do that (disable PMD/huge page mapping of vmemmap) in the next version patch. But I have another question: what the problem is when struct page pages go away? I have not understood the issues discussed here, hope you can answer for me. Thanks. > -- > Mike Kravetz --=20 Yours, Muchun