From: David Hildenbrand <david@redhat.com>
To: Zi Yan <ziy@nvidia.com>, Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>,
Michael Ellerman <mpe@ellerman.id.au>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Thomas Gleixner <tglx@linutronix.de>,
x86@kernel.org, Andy Lutomirski <luto@kernel.org>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Dan Williams <dan.j.williams@intel.com>,
Wei Yang <richard.weiyang@linux.alibaba.com>,
linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size
Date: Wed, 12 May 2021 18:14:06 +0200 [thread overview]
Message-ID: <e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com> (raw)
In-Reply-To: <746780E5-0288-494D-8B19-538049F1B891@nvidia.com>
>>
>> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.
>
> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
> should be used instead of buddy allocator while pageblock_order is kept at a small
> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happening as
> an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?
>
I am not fully sure yet where the journey will go , I guess nobody
knows. Ultimately, having buddy support for >= current MAX_ORDER (IOW,
increasing MAX_ORDER) will most probably happen, so it would be worth
investigating what has to be done to get that running as a first step.
Of course, we could temporarily think about wiring it up in the buddy like
if (order < MAX_ORDER)
__alloc_pages()...
else
alloc_contig_pages()
but it doesn't really improve the situation IMHO, just an API change.
So I think we should look into increasing MAX_ORDER, seeing what needs
to be done to have that part running while keeping the section size and
the pageblock order as is. I know that at least memory
onlining/offlining, cma, alloc_contig_range(), ... needs tweaking,
especially when we don't increase the section size (but also if we would
due to the way page isolation is currently handled). Having a MAX_ORDER
-1 page being partially in different nodes might be another thing to
look into (I heard that it can already happen right now, but I don't
remember the details).
The next step after that would then be better fragmentation avoidance
for larger granularity like 1G THP.
>>
>> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).
>
> Can you elaborate it a little more? From what I understand, memory holes mean valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
> needed instead of simple pfn++. Is there anything I am missing?
I think there are two cases when talking about MAX_ORDER and memory holes:
1. Hole with a valid memmap: the memmap is initialize to PageReserved()
and the pages are not given to the buddy. pfn_valid() and
pfn_to_page() works as expected.
2. Hole without a valid memmam: we have that CONFIG_HOLES_IN_ZONE thing
already, see include/linux/mmzone.h. pfn_valid_within() checks are
required. Doesn't win a beauty contest, but gets the job done in
existing setups that seem to care.
"If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
need to check pfn validity within that MAX_ORDER_NR_PAGES block.
pfn_valid_within() should be used in this case; we optimise this away
when we have no holes within a MAX_ORDER_NR_PAGES block."
CONFIG_HOLES_IN_ZONE is just a bad name for this.
(increasing the section size implies that we waste more memory for the
memmap in holes. increasing MAX_ORDER means that we might have to deal
with holes within MAX_ORDER chunks)
We don't have too many pfn_valid_within() checks. I wonder if we could
add something that is optimized for "holes are a power of two and
properly aligned", because pfn_valid_within() right not deals with holes
of any kind which makes it somewhat inefficient IIRC.
>
> BTW, to test a system with memory holes, do you know is there an easy of adding
> random memory holes to an x86_64 VM, which can help reveal potential missing pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
> how to do it on QEMU.
It might not be very easy that way. But I heard that some arm64 systems
have crazy memory layouts -- maybe there, it's easier to get something
nasty running? :)
https://lkml.kernel.org/r/YJpEwF2cGjS5mKma@kernel.org
I remember there was a way to define the e820 completely on kernel
cmdline, but I might be wrong ...
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-05-12 16:14 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-06 15:26 [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size Zi Yan
2021-05-06 15:26 ` [RFC PATCH 1/7] mm: sparse: set/clear subsection bitmap when pages are onlined/offlined Zi Yan
2021-05-06 17:48 ` David Hildenbrand
2021-05-06 19:03 ` Zi Yan
2021-05-06 19:14 ` David Hildenbrand
2021-05-06 15:26 ` [RFC PATCH 2/7] mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and MAX_ORDER-1 Zi Yan
2021-05-06 15:26 ` [RFC PATCH 3/7] mm: memory_hotplug: decouple memory_block size with section size Zi Yan
2021-05-06 15:26 ` [RFC PATCH 4/7] mm: pageblock: allow set/unset migratetype for partial pageblock Zi Yan
2021-05-06 15:26 ` [RFC PATCH 5/7] mm: memory_hotplug, sparse: enable memory hotplug/hotremove subsections Zi Yan
2021-05-06 15:26 ` [RFC PATCH 6/7] arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso Zi Yan
2021-05-06 15:26 ` [RFC PATCH 7/7] [not for merge] mm: increase SECTION_SIZE_BITS to 31 Zi Yan
2021-05-06 15:31 ` [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size David Hildenbrand
2021-05-06 15:37 ` Zi Yan
2021-05-06 15:40 ` David Hildenbrand
2021-05-06 15:50 ` Zi Yan
2021-05-06 16:28 ` David Hildenbrand
2021-05-06 18:49 ` Zi Yan
2021-05-06 19:10 ` David Hildenbrand
2021-05-06 19:30 ` Matthew Wilcox
2021-05-06 19:38 ` David Hildenbrand
2021-05-06 15:38 ` David Hildenbrand
2021-05-07 11:55 ` Michal Hocko
2021-05-07 14:00 ` David Hildenbrand
2021-05-10 14:36 ` Zi Yan
2021-05-12 16:14 ` David Hildenbrand [this message]
2021-06-02 15:56 ` Zi Yan
2021-06-14 11:32 ` David Hildenbrand
2021-05-06 15:42 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=benh@kernel.crashing.org \
--cc=dan.j.williams@intel.com \
--cc=linux-ia64@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=luto@kernel.org \
--cc=mhocko@suse.com \
--cc=mpe@ellerman.id.au \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=richard.weiyang@linux.alibaba.com \
--cc=rppt@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).