Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

From: "Zi Yan" <ziy@nvidia.com>
To: "Roman Gushchin" <guro@fb.com>, "Matthew Wilcox" <willy@infradead.org>
Cc: linux-mm@kvack.org,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Yang Shi" <shy828301@gmail.com>,
	"Michal Hocko" <mhocko@suse.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"David Nellans" <dnellans@nvidia.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"David Rientjes" <rientjes@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"David Hildenbrand" <david@redhat.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Song Liu" <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Wed, 31 Mar 2021 10:48:36 -0400	[thread overview]
Message-ID: <85EDDC81-72F8-4204-A111-73AEAEADA2AF@nvidia.com> (raw)
In-Reply-To: <YGPtSXm6y4vb51rW@carbon.dhcp.thefacebook.com>

[-- Attachment #1: Type: text/plain, Size: 5370 bytes --]

On 30 Mar 2021, at 23:32, Roman Gushchin wrote:

> On Wed, Mar 31, 2021 at 04:09:35AM +0100, Matthew Wilcox wrote:
>> On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
>>> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
>>>> On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
>>>>> I actually ran a large scale experiment (on tens of thousands of machines) in the last
>>>>> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
>>>>
>>>> Thanks for the information. I finally have time to come back to this. Do you mind sharing
>>>> the total memory of these machines? I want to have some idea on the scale of this issue to
>>>> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
>>>> or TBs memory?
>>>
>>> There are different configurations, but in general they are in 100's GB or smaller.
>>
>> Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
>> machines might be illuminating.
>
> No, I'm using pre-allocated cma areas, and it works fine.
> Buddyinfo stops at order 10, right?
> How it's helpful with fragmentation on 1GB scale?
>
>>
>>>>
>>>>>
>>>>> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
>>>>> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
>>>>> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
>>>>> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
>>>>
>>>> Is there a way of replicating such an environment with publicly available software?
>>>> I really want to understand the root cause and am willing to find a possible solution.
>>>> It would be much easier if I can reproduce this locally.
>>>
>>> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
>>> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
>>> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
>>> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).
>>
>> I think this is somewhere the buddy allocator could be improved.
>> Of course, it knows nothing of larger page orders (which needs to be
>> fixed), but in general, I would like it to do a better job of segregating
>> movable and unmovable allocations.
>>
>> Let's take a machine with 100GB of memory as an example.  Ideally,
>> unmovable allocations would start at 4GB (assuming below 4GB is
>> ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
>> should avoid being "near" unmovable allocations.  Perhaps they start
>> at 5GB.  When unmovable allocations get up to 5GB, we should first exert
>> a bit of pressure to shrink the unmovable allocations (looking at you,
>> dcache), but eventually we'll need to grow the unmovable allocations
>> above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
>> If this unmovable allocation was just temporary, we get a reassembled
>> 1MB page.  If it was permanent, we now have 1MB of memory to soak up
>> the next few allocations.
>>
>> The model I'm thinking of here is that we have a "line" in memory that
>> divides movable and unmovable allocations.  It can move up, but there
>> has to be significant memory pressure to do so.

Hi Roman and Matthew,

David Hildenbrand proposed an idea similar to Matthew’s, ZONE_PREFER_MOVABLE,
which prefers movable allocation and is the fallback for unmovable allocations
when ZONE_NORMAL is full [1]. Also ZONE_PREFER_MOVABLE size can be changed
dynamically on demand.

My concerns for the ideas like this are:

1. Some long-live unmovable pages might hold the boundary between movable and
unmovable, so the part for unmovable allocation might end up only increasing.
Would something like, creating a lot of dentries by going through the file
system and keeping the last visited file open, make this happen?

2. The cost of pushing the boundary. Unless both movable and unmovable allocations
are going towards the boundary, kernel will need to migrate movable pages to move
the boundary. It would create noticeable latency for unmovable allocations that
need to move the boundary, right?

>
> I agree. My idea (which I need to find some time to try) was to hack the pageblock
> code so that if we convert a block to non-movable, we can convert an entire GB
> around. In this case, all unmovable memory will likely fit into few 1GB chunks,
> leaving all other chunks movable. But from the security's point of view it will
> be less desirable, I guess. What do you think about it?

I like this idea and have thought about it too. It could reuse existing fragmentation
avoidance mechanism and does not have my concerns above. But there will still
be some work to prevent more than one 1GB pageblock from being converted after
we add support for 1GB pageblock size, otherwise, memory can still be fragmented
by unmovable pages when multiple 1GB pageblocks are converted to unmovable ones,
assuming movable allocation can fall back to unmovable pageblocks.

[1] https://lore.kernel.org/linux-mm/6135d2c5-2a74-6ca8-4b3b-8ceb25c0d4b1@redhat.com/

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]