From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACF9EC388F9 for ; Mon, 23 Nov 2020 11:32:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4630E208DB for ; Mon, 23 Nov 2020 11:32:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="bMXQceN5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728907AbgKWLcM (ORCPT ); Mon, 23 Nov 2020 06:32:12 -0500 Received: from mx2.suse.de ([195.135.220.15]:58308 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728305AbgKWLcM (ORCPT ); Mon, 23 Nov 2020 06:32:12 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1606131130; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=dPqp38s/mZipz4TL/taIvF7G2tPPVW4YenA6jbeVmDg=; b=bMXQceN51lzm/rbPrH3JtGqHG9aV2uiJZWki+1lcu0puJU2kbZ84VOvNhZxegmEbl1G6fy exujYtPuM+ic0AOtpvR7xAy4jTdm4Ub+Tm8M+HC/MsS4TBPFc6AdKKrTwlbCb9gTExRckJ rouJtFZVyL9rwOa+/c1csnwY4bZ47x4= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 0BAD3AE47; Mon, 23 Nov 2020 11:32:10 +0000 (UTC) Date: Mon, 23 Nov 2020 12:32:08 +0100 From: Michal Hocko To: Muchun Song Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , "Song Bao Hua (Barry Song)" , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Subject: Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page Message-ID: <20201123113208.GL27488@dhcp22.suse.cz> References: <20201120084202.GJ3200@dhcp22.suse.cz> <20201120131129.GO3200@dhcp22.suse.cz> <20201123074046.GB27488@dhcp22.suse.cz> <20201123094344.GG27488@dhcp22.suse.cz> <20201123104258.GJ27488@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 23-11-20 19:16:18, Muchun Song wrote: > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko wrote: > > > > On Mon 23-11-20 18:36:33, Muchun Song wrote: > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko wrote: > > > > > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote: > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko wrote: > > > > > > > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote: > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko wrote: > > > > > > > > > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote: > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko wrote: > > > > > > > > > > > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote: > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > Thanks for improving the cover letter and providing some numbers. I have > > > > > > > > > > only glanced through the patchset because I didn't really have more time > > > > > > > > > > to dive depply into them. > > > > > > > > > > > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer to not have > > > > > > > > > > the feature enablement controlled by compile time option and the kernel > > > > > > > > > > command line option should be opt-in. I also do not like that freeing > > > > > > > > > > the pool can trigger the oom killer or even shut the system down if no > > > > > > > > > > oom victim is eligible. > > > > > > > > > > > > > > > > > > Hi Michal, > > > > > > > > > > > > > > > > > > I have replied to you about those questions on the other mail thread. > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One thing that I didn't really get to think hard about is what is the > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be > > > > > > > > > > invalid when racing with the split. How do we enforce that this won't > > > > > > > > > > blow up? > > > > > > > > > > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP, > > > > > > > > > in this case, the pfn_to_page can work. The return value of the > > > > > > > > > pfn_to_page is actually the address of it's struct page struct. > > > > > > > > > I can not figure out where the problem is. Can you describe the > > > > > > > > > problem in detail please? Thanks. > > > > > > > > > > > > > > > > struct page returned by pfn_to_page might get invalid right when it is > > > > > > > > returned because vmemmap could get freed up and the respective memory > > > > > > > > released to the page allocator and reused for something else. See? > > > > > > > > > > > > > > If the HugeTLB page is already allocated from the buddy allocator, > > > > > > > the struct page of the HugeTLB can be freed? Does this exist? > > > > > > > > > > > > Nope, struct pages only ever get deallocated when the respective memory > > > > > > (they describe) is hotremoved via hotplug. > > > > > > > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator > > > > > > > (cannot access the struct page)? > > > > > > > > > > > > But I do not follow how that relates to my concern above. > > > > > > > > > > Sorry. I shouldn't understand your concerns. > > > > > > > > > > vmemmap pages page frame > > > > > +-----------+ mapping to +-----------+ > > > > > | | -------------> | 0 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 1 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 2 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 3 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 4 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 5 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 6 | > > > > > +-----------+ +-----------+ > > > > > | | -------------> | 7 | > > > > > +-----------+ +-----------+ > > > > > > > > > > In this patch series, we will free the page frame 2-7 to the > > > > > buddy allocator. You mean that pfn_to_page can return invalid > > > > > value when the pfn is the page frame 2-7? Thanks. > > > > > > > > No I really mean that pfn_to_page will give you a struct page pointer > > > > from pages which you release from the vmemmap page tables. Those pages > > > > might get reused as soon sa they are freed to the page allocator. > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page > > > frame 1. And then we free page frame 2-7 to the buddy allocator. > > > > And this doesn't really happen in an atomic fashion from the pfn walker > > POV, right? So it is very well possible that > > Yeah, you are right. But it may not be a problem for HugeTLB pages. > Because in most cases, we only read the tail struct page and get the > head struct page through compound_head() when the pfn is within > a HugeTLB range. Right? Many pfn walkers would encounter the head page first and then skip over the rest. Those should be reasonably safe. But there is no guarantee and the fact that you need a valid page->compound_head which might get scribbled over once you have the struct page makes this extremely subtle. -- SUSE Labs