From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xipb=NA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9C6FCC4338F
	for <linux-mm@archiver.kernel.org>; Mon,  9 Aug 2021 07:20:37 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2FAE860F4B
	for <linux-mm@archiver.kernel.org>; Mon,  9 Aug 2021 07:20:37 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2FAE860F4B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8520C6B007E; Mon,  9 Aug 2021 03:20:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 802416B0080; Mon,  9 Aug 2021 03:20:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6C98C8D0003; Mon,  9 Aug 2021 03:20:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132])
	by kanga.kvack.org (Postfix) with ESMTP id 506A46B007E
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 03:20:36 -0400 (EDT)
Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 0286218F4A
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 07:20:36 +0000 (UTC)
X-FDA: 78454694472.35.77D145F
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf28.hostedemail.com (Postfix) with ESMTP id 808B1900C508
	for <linux-mm@kvack.org>; Mon,  9 Aug 2021 07:20:34 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1628493633;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bJDimoDYG/HnL2+KloR5q89QeG8EXr3Ft9bM0dBgjf8=;
	b=ZSaD925Q6ZJbUJOifF8zRQiX52N38m4/ucKJlBAgmwytXhMhYiFb87lvtPqa3bAME+2QAc
	NHXr8gizV3byPcBYJoHzWOqiXxMJXr6Qg0V6mPsM5EnhL3FC1Z6+EwOXxiCcd8EwFs5eb+
	qNhdULajRVV0obgzp2HCdvlhmzO3hRU=
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-373-ylAjENcLPIyDp5TMxuNwIQ-1; Mon, 09 Aug 2021 03:20:32 -0400
X-MC-Unique: ylAjENcLPIyDp5TMxuNwIQ-1
Received: by mail-wr1-f71.google.com with SMTP id n6-20020a5d67c60000b0290153b1422916so5065967wrw.2
        for <linux-mm@kvack.org>; Mon, 09 Aug 2021 00:20:32 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=bJDimoDYG/HnL2+KloR5q89QeG8EXr3Ft9bM0dBgjf8=;
        b=X5OWVKz+3qJS9VXL1H9riV+aSZFkEYHuYswLCvLfttCCSs94clHDjcJoAEMx/bFjEU
         5jKZW2BDvQVSvOc4CWQ0OFRi1nXJqQZFF6Sj0Foq9bfesCHF4By95+roD5bU9U5Uh5A6
         54//v3VHHbV1qgadgHwkEOioBUfjq2L1JKdtJfWbbAB/5T7QwqZi2vaIpp7K4zq9QtCy
         AItl7b0FWwDrbb1rO5ZqogiSHDhh4xu1bQboN5A2MbeToMDbkN7HlcoLMJxCesXXatzU
         7OvOM2AAA1CltO5LtiV3oAsb/oeWDFQQtMdYPEl86GXot1mXiGx8lidQFmgaoCG7/GRo
         p53g==
X-Gm-Message-State: AOAM531qU/rvx9Hcc0wl6ySgqlyjKqJ/6+QSdnpy5cLOvcWxjk3+QgX1
	g+B3iqb6MGJyQgbJ+aGFwKsL5gN8rHy16zYaNJrU1OSsW+qRildobkGbHxpzVeJPhrDroEFCW6A
	/2JBaXfM+beY=
X-Received: by 2002:a05:600c:210d:: with SMTP id u13mr5832350wml.57.1628493631041;
        Mon, 09 Aug 2021 00:20:31 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJz7QKw8UFOMCDDGYlhf6JXwCFNKFjdiwm9HowlbZFC+BqP2SQtUVjqzcO6n2K1e07oNt8Q1cg==
X-Received: by 2002:a05:600c:210d:: with SMTP id u13mr5832325wml.57.1628493630765;
        Mon, 09 Aug 2021 00:20:30 -0700 (PDT)
Received: from ?IPv6:2003:d8:2f0a:7f00:fad7:3bc9:69d:31f? (p200300d82f0a7f00fad73bc9069d031f.dip0.t-ipconnect.de. [2003:d8:2f0a:7f00:fad7:3bc9:69d:31f])
        by smtp.gmail.com with ESMTPSA id y3sm7160099wma.32.2021.08.09.00.20.29
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 09 Aug 2021 00:20:30 -0700 (PDT)
To: Zi Yan <ziy@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org,
 Matthew Wilcox <willy@infradead.org>,
 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
 Mike Kravetz <mike.kravetz@oracle.com>, Michal Hocko <mhocko@kernel.org>,
 John Hubbard <jhubbard@nvidia.com>, linux-kernel@vger.kernel.org,
 Mike Rapoport <rppt@linux.ibm.com>
References: <20210805190253.2795604-1-zi.yan@sent.com>
 <40982106-0eee-4e62-7ce0-c4787b0afac4@suse.cz>
 <72b317e5-c78a-f0bc-fe69-f82261ec252e@redhat.com>
 <3417eb98-36c8-5459-c83e-52f90e42a146@suse.cz>
 <59c59a77-cf93-40a8-2ad5-b72d87b8815a@redhat.com>
 <83221D29-5ABE-40F1-8FF3-3B901E494C33@nvidia.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time
 parameter.
Message-ID: <fefb01a0-057b-8174-a7ca-ad914b864953@redhat.com>
Date: Mon, 9 Aug 2021 09:20:28 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <83221D29-5ABE-40F1-8FF3-3B901E494C33@nvidia.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 808B1900C508
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZSaD925Q;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf28.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-Stat-Signature: jspaps544ste16g4mb7dezxros9kwi3j
X-HE-Tag: 1628493634-304166
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 06.08.21 20:24, Zi Yan wrote:
> On 6 Aug 2021, at 13:08, David Hildenbrand wrote:
>=20
>> On 06.08.21 18:54, Vlastimil Babka wrote:
>>> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>>>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator=
 can merge
>>>>>> pages across memory sections. The check was removed when ARM64 get=
s rid of holes
>>>>>> in zones, but holes can appear in zones again after this patchset.
>>>>>
>>>>> To me that's most unwelcome resurrection. I kinda missed it was goi=
ng away and
>>>>> now I can't even rejoice? I assume the systems that will be bumping=
 max_order
>>>>> have a lot of memory. Are they going to have many holes? What if we=
 just
>>>>> sacrificed the memory that would have a hole and don't add it to bu=
ddy at all?
>>>>
>>>> I think the old implementation was just horrible and the description=
 we have
>>>> here still suffers from that old crap: "but holes can appear in zone=
s again".
>>>> No, it's not related to holes in zones at all. We can have MAX_ORDER=
 -1 pages
>>>> that are partially a hole.
>>>>
>>>> And to be precise, "hole" here means "there is no memmap" and not "t=
here is a
>>>> hole but it has a valid memmap".
>>>
>>> Yes.
>>>
>>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a=
 complete
>>>> memory sections (when talking about system RAM, ZONE_DEVICE is diffe=
rent but we
>>>> don't really care for now I think).
>>>>
>>>> So instead of introducing what we had before, I think we should look=
 into
>>>> something that doesn't confuse each person that stumbles over it out=
 there. What
>>>> does pfn_valid_within() even mean in the new context? pfn_valid() is=
 most
>>>> probably no longer what we really want, as we're dealing with multip=
le sections
>>>> that might be online or offline; in the old world, this was differen=
t, as a
>>>> MAX_ORDER -1 page was completely contained in a memory section that =
was either
>>>> online or offline.
>>>>
>>>> I'd imagine something that expresses something different in the cont=
ext of
>>>> sparsemem:
>>>>
>>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory =
sections.
>>>> Each memory section has a completely valid memmap if online. Memory =
sections
>>>> might either be completely online or completely offline. pfn_to_onli=
ne_page()
>>>> might succeed on one part of a MAX_ORDER - 1 page, but not on anothe=
r part. But
>>>> it will certainly be consistent within one memory section."
>>>>
>>>> Further, as we know that MAX_ORDER -1 and memory sections are a powe=
r of two, we
>>>> can actually do a binary search to identify boundaries, instead of h=
aving to
>>>> check each and every page in the range.
>>>>
>>>> Is what I describe the actual reason why we introduce pfn_valid_with=
in() ? (and
>>>> might better introduce something new, with a better fitting name?)
>>>
>>> What I don't like is mainly the re-addition of pfn_valid_within() (or=
 whatever
>>> we'd call it) into __free_one_page() for performance reasons, and als=
o to
>>> various pfn scanners (compaction) for performance and "I must not for=
get to
>>> check this, or do I?" confusion reasons. It would be really great if =
we could
>>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two w=
ays to
>>> achieve that:
>>>
>>> 1. we create memmap for MAX_ORDER blocks, pages in sections not onlin=
e are
>>> marked as reserved or some other state that allows us to do checks su=
ch as "is
>>> there a buddy? no" without accessing a missing memmap
>>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>>>
>>> I think 1 would be more work, but less wasteful in the end?
>>
>> It will end up seriously messing with memory hot(un)plug. It's not suf=
ficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_o=
nline_page()) to actually have a meaning.
>>
>> So you'd have to  allocate a memmap for all such memory sections, init=
ialize it to all PG_Reserved ("memory hole") and mark these memory sectio=
ns online. Further, you need memory block devices that are initialized an=
d online.
>>
>> So far so good, although wasteful. What happens if someone hotplugs a =
memory block that doesn't span a complete MAX_ORDER -1 page? Broken.
>>
>>
>> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be =
bigger than memory blocks (memory_block_size_bytes()). The memory block s=
ize determines our hot(un)plug granularity and can (on some archs already=
) be determined at runtime. As both (MAX_ORDER and memory_block_size_byte=
s) would be determined at runtime, for example, by an admin explicitly re=
questing it, this might be feasible.
>>
>>
>> Memory hot(un)plug / onlining / offlining would most probably work nat=
urally (although the hot(un)plug granularity is then limited to e.g., 1Gi=
B memory blocks). But if that's what an admin requests on the command lin=
e, so be it.
>>
>> What might need some thought, though, is having overlapping sections/s=
uch memory blocks with devmem. Sub-section hotadd has to continue working=
 unless we want to break some PMEM devices seriously.
>=20
> Thanks a lot for your valuable inputs!
>=20
> Yes, this might work. But it seems to also defeat the purpose of sparse=
 memory, which allow only memmapping present PFNs, right?

Not really. I will only be suboptimal for corner cases.

Except corner cases for devemem, we already always populate the memmap=20
for complete memory sections. Now, we would populate the memmap for all=20
memory sections spanning a MAX_ORDER - 1 page, if bigger than a section.

Will it matter in practice? I doubt it.

I consider 1 GiB allocations only relevant for really big machines.=20
There, we don't really expect to have a lot of random memory holes. On a=20
1 TiB machine, with 1 GiB memory blocks and 1 GiB max_order - 1, you=20
don't expect to have a completely fragmented memory layout such that=20
allocating additional memmap for some memory sections really makes a=20
difference.

> Also it requires a lot more intrusive changes, which might not be accep=
ted easily.

I guess it should require quite minimal changes in contrast to what you=20
propose. What we should have to do is

a) Check that the configured MAX_ORDER - 1 is effectively not bigger=20
than the memory block size

b) Initialize all sections spanning a MAX_ORDER - 1 during boot, we=20
won't even have to mess with memory blocks et al.

All that's required is parsing/processing early parameters in the right=20
order.

That sound very little intrusive compared to what you propose. Actually,=20
I think what you propose would be an optimization of that approach.


> I will look into the cost of the added pfn checks and try to optimize i=
t. One thing I can think of is that these non-present PFNs should only ap=
pear at the beginning and at the end of a zone, since HOLES_IN_ZONE is go=
ne, maybe I just need to store and check PFN range of a zone instead of c=
hecking memory section validity and modify the zone PFN range during memo=
ry hot(un)plug. For offline pages in the middle of a zone, struct page st=
ill exists and PageBuddy() returns false, since PG_offline is set, right?

I think we can have quite some crazy "sparse" layouts where you can have=20
random holes within a zone, not only at the beginning/end.

Offline pages can be identified using pfn_to_online_page(). You must not=20
touch their memmap, not even to check for PageBuddy(). PG_offline is a=20
special case where pfn_to_online_page() returns true and the memmap is=20
valid, however, the pages are logically offline and  might get logically=20
onlined later -- primarily used in virtualized environments, for=20
example, with memory ballooning.

You can treat PG_offline pages like their are online, they just are=20
accounted differently (!managed) and shouldn't be touched; but=20
otherwise, they are just like any other allocated page.

--=20
Thanks,

David / dhildenb