From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84EC3C433ED for ; Thu, 6 May 2021 16:28:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E029960249 for ; Thu, 6 May 2021 16:28:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E029960249 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 49C0A6B006C; Thu, 6 May 2021 12:28:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3FDB96B0071; Thu, 6 May 2021 12:28:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 216716B0072; Thu, 6 May 2021 12:28:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0089.hostedemail.com [216.40.44.89]) by kanga.kvack.org (Postfix) with ESMTP id F3BF06B006C for ; Thu, 6 May 2021 12:28:41 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id AD791181AC9B6 for ; Thu, 6 May 2021 16:28:41 +0000 (UTC) X-FDA: 78111339642.21.B366642 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf20.hostedemail.com (Postfix) with ESMTP id C9E94F6 for ; Thu, 6 May 2021 16:28:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1620318520; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u6Yy2DeoFV2q1M9s+whrk/gaR4i+2vEmv7two1Af1Zg=; b=iT5Pd+tgdWskAEUlxF36yarOw4aAFWyBppAIZ7cmaFYYk4sCj0bzz2tyky34+YcQUPccQt +KMMNl/dO798eAzl+mHFFirB768jU9wFY42dEE/vdBIxoS8OyrJjnpNVlx0DlCWJq5B3bU MYF4yC30UIhOprn5mHeRvJytORj0hFA= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-488-67vGasMGPPKl7eC9zzQSlw-1; Thu, 06 May 2021 12:28:39 -0400 X-MC-Unique: 67vGasMGPPKl7eC9zzQSlw-1 Received: by mail-ej1-f70.google.com with SMTP id cs18-20020a170906dc92b02903a8adf202d6so856003ejc.23 for ; Thu, 06 May 2021 09:28:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=u6Yy2DeoFV2q1M9s+whrk/gaR4i+2vEmv7two1Af1Zg=; b=JAkRzWE/+XiYDfTZCG5ELxBZJZZatSmevYpOSv5yKhLJ48NUzFgnfeqYIM3RjHpSvJ CkV0TwICf0J5hlZbmFY3Eyokq4ReWfT2PzSKW15eCn3y6tNNm2At7kE2WgHGpgAdWUeQ y/KTePWhcUwhV5rpnRg13oVMVzcB0wA9KCoQI3SLaJWPvUMtvr61Mx6JS1C9Qle7Mb6o b1rPGtdvF4ka2RwKoOoj+X3qglWfGrOMNlZelkWQS9P2eQArq0oDCtZiODynyHzyDyNs V96exLqgXBXSy8lX4P/EAOz81b0jr6GQsTW5sP5lpc0jwbNyEKXevbANccYywazvR2aX gBgw== X-Gm-Message-State: AOAM532o6sfauf6S/1gWaAyN+DqtXgJMCoXk1/pNPC/0JxeHcLP7niz3 ytyDKw1PRUifm4DA//2dfH8rf8qGqGZi2Fzt78cbgBLiPMeEv8i212OxiwOzgu1zYjwD+T8H0HA Cz8RlGUW8HJd3F/BXbjJ5q8aQ71oZ5TnzwC7dTrA0a7r2WC2wUrkzc3RV7Is= X-Received: by 2002:a17:906:c40f:: with SMTP id u15mr5291075ejz.11.1620318517886; Thu, 06 May 2021 09:28:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyIBstwIoHL3+1II6XHx8KFDHdJwSWiiY8FJejBQ6M4H/9tCY0WFC2mPVvB4tI88ziJezvAaw== X-Received: by 2002:a17:906:c40f:: with SMTP id u15mr5291031ejz.11.1620318517505; Thu, 06 May 2021 09:28:37 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64ae.dip0.t-ipconnect.de. [91.12.100.174]) by smtp.gmail.com with ESMTPSA id 9sm1668199ejv.73.2021.05.06.09.28.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 06 May 2021 09:28:37 -0700 (PDT) To: Zi Yan Cc: Oscar Salvador , Michael Ellerman , Benjamin Herrenschmidt , Thomas Gleixner , x86@kernel.org, Andy Lutomirski , "Rafael J . Wysocki" , Andrew Morton , Mike Rapoport , Anshuman Khandual , Michal Hocko , Dan Williams , Wei Yang , linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org References: <20210506152623.178731-1-zi.yan@sent.com> <9D7FD316-988E-4B11-AC1C-64FF790BA79E@nvidia.com> <3a51f564-f3d1-c21f-93b5-1b91639523ec@redhat.com> <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size Message-ID: Date: Thu, 6 May 2021 18:28:35 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: C9E94F6 Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iT5Pd+tg; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf20.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam04 X-Stat-Signature: xxhkckrgmc75oobixsmqyjc1ucrydq7c Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf20; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620318513-786433 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 06.05.21 17:50, Zi Yan wrote: > On 6 May 2021, at 11:40, David Hildenbrand wrote: >=20 >>>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use o= f memory >>>>> hotplug/hotremove subsection, but is not intended to be merged as i= s. It is >>>>> there in case one wants to try this out and will be removed during = the final >>>>> submission. >>>>> >>>>> Feel free to give suggestions and comments. I am looking forward to= your >>>>> feedback. >>>> >>>> Please not like this. >>> >>> Do you mind sharing more useful feedback instead of just saying a lot= of No? >> >> I remember reasoning about this already in another thread, no? Either = you're ignoring my previous feedback or my mind is messing with me. >=20 > I definitely remember all your suggestions: >=20 > 1. do not use CMA allocation for 1GB THP. > 2. section size defines the minimum size in which we can add_memory(), = so we cannot increase it. >=20 > I am trying an alternative here. I am not using CMA allocation and not = increasing the minimum size of add_memory() by decoupling the memory bloc= k size from section size, so that add_memory() can add a memory block sma= ller (as small as 2MB, the subsection size) than section size. In this wa= y, section size can be increased freely. I do not see the strong tie betw= een add_memory() and section size, especially we have subsection bitmap s= upport. Okay, let me express my thoughts, I could have sworn I explained back=20 then why I am not a friend of messing with the existing pageblock size: 1. Pageblock size There are a couple of features that rely on the pageblock size to be=20 reasonably small to work as expected. One example is virtio-balloon free=20 page reporting, then there is virtio-mem (still also glued MAX_ORDER)=20 and we have CMA (still also glued to MAX_ORDER). Most probably there are=20 more. We track movability/ page isolation per pageblock; it's the=20 smallest granularity you can effectively isolate pages or mark them as=20 CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP /=20 huge pages most of our applications use and will use, especially on=20 smallish systems. Assume you bump up the pageblock order to 1G. Small VMs won't be able to=20 report any free pages to the hypervisor. You'll take the "fine-grained"=20 out of virtio-mem. Each CMA area will have to be at least 1G big, which=20 turns CMA essentially useless on smallish systems (like we have on arm64=20 with 64k base pages -- pageblock_size is 512MB and I hate it). Then, imagine systems that have like 4G of main memory. By stopping=20 grouping at 2M and instead grouping at 1G you can very easily find=20 yourself in the system where all your 4 pageblocks are unmovable and you=20 essentially don't optimize for huge pages in that environment any more. Long story short: we need a different mechanism on top and shall leave=20 the pageblock size untouched, it's too tightly integrated with page=20 isolation, ordinary THP, and CMA. 2. Section size I assume the only reason you want to touch that is because=20 pageblock_size <=3D section_size, and I guess that's one of the reasons I= =20 dislike it so much. Messing with the section size really only makes=20 sense when we want to manage metadata for larger granularity within a=20 section. We allocate metadata per section. We mark whole sections=20 early/online/present/.... Yes, in case of vmemmap, we manage the memmap=20 in smaller granularity using the sub-section map, some kind of hack to=20 support some ZONE_DEVICE cases better. Let's assume we introduce something new "gigapage_order", corresponding=20 to 1G. We could either decide to squeeze the metadata into sections,=20 having to increase the section size, or manage that metadata differently. Managing it differently certainly makes the necessary changes easier.=20 Instead of adding more hacks into sections, rather manage that metadata=20 at differently place / in a different way. See [1] for an alternative. Not necessarily what I would dream off, but=20 just to showcase that there might be alternative to group pages. 3. Grouping pages > pageblock_order There are other approaches that would benefit from grouping at >=20 pageblock_order and having bigger MAX_ORDER. And that doesn't=20 necessarily mean to form gigantic pages only, we might want to group in=20 multiple granularity on a single system. Memory hot(un)plug is one=20 example, but also optimizing memory consumption by powering down DIMM=20 banks. Also, some architectures support differing huge page sizes=20 (aarch64) that could be improved without CMA. Why not have more than 2=20 THP sizes on these systems? Ideally, we'd have a mechanism that tries grouping on different=20 granularity, like for every order in pageblock_order ...=20 max_pageblock_order (e.g., 1 GiB), and not only add one new level of=20 grouping (or increase the single grouping size). [1] https://lkml.kernel.org/r/20210414023803.937-1-lipeifeng@oppo.com --=20 Thanks, David / dhildenb