On 30 Mar 2021, at 14:02, Roman Gushchin wrote: > On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote: >> Hi Roman, >> >> >> On 4 Mar 2021, at 11:45, Roman Gushchin wrote: >> >>> On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote: >>>> On 1 Mar 2021, at 20:59, Roman Gushchin wrote: >>>> >>>>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote: >>>>>> From: Zi Yan >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29 >>>>>> and the code is available at >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fx-y-z%2Flinux-1gb-thp%2Ftree%2F1gb_thp_v5.11-mmotm-2021-02-18-18-29&data=04%7C01%7Cziy%40nvidia.com%7C49dd8b5e66994e6b7f5e08d8f3a5fa13%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637527241503834147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3jjPz8HTJDn3bYWhrwKToMCXDScZuCoqsEsink3eGZE%3D&reserved=0 >>>>>> if you want to give it a try. The actual 49 patches are not sent out with this >>>>>> cover letter. :) >>>>>> >>>>>> Instead of asking for code review, I would like to discuss on the concerns I got >>>>>> from previous RFCs. I think there are two major ones: >>>>>> >>>>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA >>>>>> regions that are reserved at boot time like hugetlbfs. The concerns on >>>>>> using CMA is that an educated guess is needed to avoid depleting kernel >>>>>> memory in case CMA regions are set too large. Recently David Rientjes >>>>>> proposes to use process_madvise() for hugepage collapse, which is an >>>>>> alternative [1] but might not work for 1GB pages, since there is no way of >>>>>> _allocating_ a 1GB page to which collapse pages. I proposed a similar >>>>>> approach at LSF/MM 2019, generating physically contiguous memory after pages >>>>>> are allocated [2], which is usable for 1GB THPs. This approach does in-place >>>>>> huge page promotion thus does not require page allocation. >>>>> >>>>> Well, I don't think there an alternative to cma as now. When the memory is almost >>>>> filled at least once, any subsequent activity leading to substantial slab allocations >>>>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances >>>>> to find a continuous GB. >>>>> >>>>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping >>>>> non-movable pageblocks, but it seems a separate project. >>>> >>>> My experiments showed that finding continuous GBs is possible, but I agree that >>>> CMA is more reliable and 1GB scale defragmentation should be a separate project. >>> >>> I actually ran a large scale experiment (on tens of thousands of machines) in the last >>> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same. >> >> Thanks for the information. I finally have time to come back to this. Do you mind sharing >> the total memory of these machines? I want to have some idea on the scale of this issue to >> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs, >> or TBs memory? > > There are different configurations, but in general they are in 100's GB or smaller. > >> >>> >>> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory). >>> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations >>> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not >>> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it. >> >> Is there a way of replicating such an environment with publicly available software? >> I really want to understand the root cause and am willing to find a possible solution. >> It would be much easier if I can reproduce this locally. > > There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent > allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There > is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents > the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations). > >> >>> >>> Even with cma we had to fix a number of additional problems (like sub-optimal placement >>> of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have >>> a reasonable success rate about ~95-99%. And it's not 100% anyway. >>> >>> The problem with artificial tests is that you're likely experimenting on a freshly >>> rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory >>> state of a production server. >> >> Yes, I agree that my experiment is not representative. Can you provide more information >> on what application behavior(s) leading to this memory fragmentation? I guess it is >> because non-moveable pages spread across the entire physical memory space. Is there >> a quick reproducer for that? > > I have a simple c program which is able to fragment the memory, you can play with it: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frgushchin%2Ffragm&data=04%7C01%7Cziy%40nvidia.com%7C49dd8b5e66994e6b7f5e08d8f3a5fa13%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637527241503834147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RE9CfPG2fG7lZfHuiW78jlJewajJzJ2DCbbmGJpWPRU%3D&reserved=0 . > > But as I said, basically any load which is actively using the whole memory > will fragment it. With your simple program, I am able to fragment the memory to the condition that it is impossible to allocate/generate 1GB pages. I will look into this. Thanks. — Best Regards, Yan Zi