From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=CiPL=KB=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 84EC3C433ED
	for <linux-mm@archiver.kernel.org>; Thu,  6 May 2021 16:28:43 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id E029960249
	for <linux-mm@archiver.kernel.org>; Thu,  6 May 2021 16:28:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E029960249
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 49C0A6B006C; Thu,  6 May 2021 12:28:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3FDB96B0071; Thu,  6 May 2021 12:28:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 216716B0072; Thu,  6 May 2021 12:28:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0089.hostedemail.com [216.40.44.89])
	by kanga.kvack.org (Postfix) with ESMTP id F3BF06B006C
	for <linux-mm@kvack.org>; Thu,  6 May 2021 12:28:41 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id AD791181AC9B6
	for <linux-mm@kvack.org>; Thu,  6 May 2021 16:28:41 +0000 (UTC)
X-FDA: 78111339642.21.B366642
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf20.hostedemail.com (Postfix) with ESMTP id C9E94F6
	for <linux-mm@kvack.org>; Thu,  6 May 2021 16:28:33 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1620318520;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=u6Yy2DeoFV2q1M9s+whrk/gaR4i+2vEmv7two1Af1Zg=;
	b=iT5Pd+tgdWskAEUlxF36yarOw4aAFWyBppAIZ7cmaFYYk4sCj0bzz2tyky34+YcQUPccQt
	+KMMNl/dO798eAzl+mHFFirB768jU9wFY42dEE/vdBIxoS8OyrJjnpNVlx0DlCWJq5B3bU
	MYF4yC30UIhOprn5mHeRvJytORj0hFA=
Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com
 [209.85.218.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-488-67vGasMGPPKl7eC9zzQSlw-1; Thu, 06 May 2021 12:28:39 -0400
X-MC-Unique: 67vGasMGPPKl7eC9zzQSlw-1
Received: by mail-ej1-f70.google.com with SMTP id cs18-20020a170906dc92b02903a8adf202d6so856003ejc.23
        for <linux-mm@kvack.org>; Thu, 06 May 2021 09:28:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=u6Yy2DeoFV2q1M9s+whrk/gaR4i+2vEmv7two1Af1Zg=;
        b=JAkRzWE/+XiYDfTZCG5ELxBZJZZatSmevYpOSv5yKhLJ48NUzFgnfeqYIM3RjHpSvJ
         CkV0TwICf0J5hlZbmFY3Eyokq4ReWfT2PzSKW15eCn3y6tNNm2At7kE2WgHGpgAdWUeQ
         y/KTePWhcUwhV5rpnRg13oVMVzcB0wA9KCoQI3SLaJWPvUMtvr61Mx6JS1C9Qle7Mb6o
         b1rPGtdvF4ka2RwKoOoj+X3qglWfGrOMNlZelkWQS9P2eQArq0oDCtZiODynyHzyDyNs
         V96exLqgXBXSy8lX4P/EAOz81b0jr6GQsTW5sP5lpc0jwbNyEKXevbANccYywazvR2aX
         gBgw==
X-Gm-Message-State: AOAM532o6sfauf6S/1gWaAyN+DqtXgJMCoXk1/pNPC/0JxeHcLP7niz3
	ytyDKw1PRUifm4DA//2dfH8rf8qGqGZi2Fzt78cbgBLiPMeEv8i212OxiwOzgu1zYjwD+T8H0HA
	Cz8RlGUW8HJd3F/BXbjJ5q8aQ71oZ5TnzwC7dTrA0a7r2WC2wUrkzc3RV7Is=
X-Received: by 2002:a17:906:c40f:: with SMTP id u15mr5291075ejz.11.1620318517886;
        Thu, 06 May 2021 09:28:37 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyIBstwIoHL3+1II6XHx8KFDHdJwSWiiY8FJejBQ6M4H/9tCY0WFC2mPVvB4tI88ziJezvAaw==
X-Received: by 2002:a17:906:c40f:: with SMTP id u15mr5291031ejz.11.1620318517505;
        Thu, 06 May 2021 09:28:37 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c64ae.dip0.t-ipconnect.de. [91.12.100.174])
        by smtp.gmail.com with ESMTPSA id 9sm1668199ejv.73.2021.05.06.09.28.36
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 06 May 2021 09:28:37 -0700 (PDT)
To: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>, Michael Ellerman
 <mpe@ellerman.id.au>, Benjamin Herrenschmidt <benh@kernel.crashing.org>,
 Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org,
 Andy Lutomirski <luto@kernel.org>, "Rafael J . Wysocki" <rafael@kernel.org>,
 Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>, Michal Hocko
 <mhocko@suse.com>, Dan Williams <dan.j.williams@intel.com>,
 Wei Yang <richard.weiyang@linux.alibaba.com>, linux-ia64@vger.kernel.org,
 linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
 linux-mm@kvack.org
References: <20210506152623.178731-1-zi.yan@sent.com>
 <fb60eabd-f8ef-2cb1-7338-7725efe3c286@redhat.com>
 <9D7FD316-988E-4B11-AC1C-64FF790BA79E@nvidia.com>
 <3a51f564-f3d1-c21f-93b5-1b91639523ec@redhat.com>
 <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size
Message-ID: <f3a2152c-685b-2141-3e33-b2bcab8b6010@redhat.com>
Date: Thu, 6 May 2021 18:28:35 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Queue-Id: C9E94F6
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iT5Pd+tg;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf20.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com
X-Rspamd-Server: rspam04
X-Stat-Signature: xxhkckrgmc75oobixsmqyjc1ucrydq7c
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf20; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620318513-786433
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 06.05.21 17:50, Zi Yan wrote:
> On 6 May 2021, at 11:40, David Hildenbrand wrote:
>=20
>>>>> The last patch increases SECTION_SIZE_BITS to demonstrate the use o=
f memory
>>>>> hotplug/hotremove subsection, but is not intended to be merged as i=
s. It is
>>>>> there in case one wants to try this out and will be removed during =
the final
>>>>> submission.
>>>>>
>>>>> Feel free to give suggestions and comments. I am looking forward to=
 your
>>>>> feedback.
>>>>
>>>> Please not like this.
>>>
>>> Do you mind sharing more useful feedback instead of just saying a lot=
 of No?
>>
>> I remember reasoning about this already in another thread, no? Either =
you're ignoring my previous feedback or my mind is messing with me.
>=20
> I definitely remember all your suggestions:
>=20
> 1. do not use CMA allocation for 1GB THP.
> 2. section size defines the minimum size in which we can add_memory(), =
so we cannot increase it.
>=20
> I am trying an alternative here. I am not using CMA allocation and not =
increasing the minimum size of add_memory() by decoupling the memory bloc=
k size from section size, so that add_memory() can add a memory block sma=
ller (as small as 2MB, the subsection size) than section size. In this wa=
y, section size can be increased freely. I do not see the strong tie betw=
een add_memory() and section size, especially we have subsection bitmap s=
upport.

Okay, let me express my thoughts, I could have sworn I explained back=20
then why I am not a friend of messing with the existing pageblock size:

1. Pageblock size

There are a couple of features that rely on the pageblock size to be=20
reasonably small to work as expected. One example is virtio-balloon free=20
page reporting, then there is virtio-mem (still also glued MAX_ORDER)=20
and we have CMA (still also glued to MAX_ORDER). Most probably there are=20
more. We track movability/ page isolation per pageblock; it's the=20
smallest granularity you can effectively isolate pages or mark them as=20
CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP /=20
huge pages most of our applications use and will use, especially on=20
smallish systems.

Assume you bump up the pageblock order to 1G. Small VMs won't be able to=20
report any free pages to the hypervisor. You'll take the "fine-grained"=20
out of virtio-mem. Each CMA area will have to be at least 1G big, which=20
turns CMA essentially useless on smallish systems (like we have on arm64=20
with 64k base pages -- pageblock_size is 512MB and I hate it).

Then, imagine systems that have like 4G of main memory. By stopping=20
grouping at 2M and instead grouping at 1G you can very easily find=20
yourself in the system where all your 4 pageblocks are unmovable and you=20
essentially don't optimize for huge pages in that environment any more.

Long story short: we need a different mechanism on top and shall leave=20
the pageblock size untouched, it's too tightly integrated with page=20
isolation, ordinary THP, and CMA.

2. Section size

I assume the only reason you want to touch that is because=20
pageblock_size <=3D section_size, and I guess that's one of the reasons I=
=20
dislike it so much. Messing with the section size really only makes=20
sense when we want to manage metadata for larger granularity within a=20
section.

We allocate metadata per section. We mark whole sections=20
early/online/present/.... Yes, in case of vmemmap, we manage the memmap=20
in smaller granularity using the sub-section map, some kind of hack to=20
support some ZONE_DEVICE cases better.

Let's assume we introduce something new "gigapage_order", corresponding=20
to 1G. We could either decide to squeeze the metadata into sections,=20
having to increase the section size, or manage that metadata differently.

Managing it differently certainly makes the necessary changes easier.=20
Instead of adding more hacks into sections, rather manage that metadata=20
at differently place / in a different way.

See [1] for an alternative. Not necessarily what I would dream off, but=20
just to showcase that there might be alternative to group pages.

3. Grouping pages > pageblock_order

There are other approaches that would benefit from grouping at >=20
pageblock_order and having bigger MAX_ORDER. And that doesn't=20
necessarily mean to form gigantic pages only, we might want to group in=20
multiple granularity on a single system. Memory hot(un)plug is one=20
example, but also optimizing memory consumption by powering down DIMM=20
banks. Also, some architectures support differing huge page sizes=20
(aarch64) that could be improved without CMA. Why not have more than 2=20
THP sizes on these systems?

Ideally, we'd have a mechanism that tries grouping on different=20
granularity, like for every order in pageblock_order ...=20
max_pageblock_order (e.g., 1 GiB), and not only add one new level of=20
grouping (or increase the single grouping size).

[1] https://lkml.kernel.org/r/20210414023803.937-1-lipeifeng@oppo.com

--=20
Thanks,

David / dhildenb