Hi Roman, On 4 Mar 2021, at 11:45, Roman Gushchin wrote: > On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote: >> On 1 Mar 2021, at 20:59, Roman Gushchin wrote: >> >>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote: >>>> From: Zi Yan >>>> >>>> Hi all, >>>> >>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29 >>>> and the code is available at >>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29 >>>> if you want to give it a try. The actual 49 patches are not sent out with this >>>> cover letter. :) >>>> >>>> Instead of asking for code review, I would like to discuss on the concerns I got >>>> from previous RFCs. I think there are two major ones: >>>> >>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA >>>> regions that are reserved at boot time like hugetlbfs. The concerns on >>>> using CMA is that an educated guess is needed to avoid depleting kernel >>>> memory in case CMA regions are set too large. Recently David Rientjes >>>> proposes to use process_madvise() for hugepage collapse, which is an >>>> alternative [1] but might not work for 1GB pages, since there is no way of >>>> _allocating_ a 1GB page to which collapse pages. I proposed a similar >>>> approach at LSF/MM 2019, generating physically contiguous memory after pages >>>> are allocated [2], which is usable for 1GB THPs. This approach does in-place >>>> huge page promotion thus does not require page allocation. >>> >>> Well, I don't think there an alternative to cma as now. When the memory is almost >>> filled at least once, any subsequent activity leading to substantial slab allocations >>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances >>> to find a continuous GB. >>> >>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping >>> non-movable pageblocks, but it seems a separate project. >> >> My experiments showed that finding continuous GBs is possible, but I agree that >> CMA is more reliable and 1GB scale defragmentation should be a separate project. > > I actually ran a large scale experiment (on tens of thousands of machines) in the last > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same. Thanks for the information. I finally have time to come back to this. Do you mind sharing the total memory of these machines? I want to have some idea on the scale of this issue to make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs, or TBs memory? > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory). > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it. Is there a way of replicating such an environment with publicly available software? I really want to understand the root cause and am willing to find a possible solution. It would be much easier if I can reproduce this locally. > > Even with cma we had to fix a number of additional problems (like sub-optimal placement > of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have > a reasonable success rate about ~95-99%. And it's not 100% anyway. > > The problem with artificial tests is that you're likely experimenting on a freshly > rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory > state of a production server. Yes, I agree that my experiment is not representative. Can you provide more information on what application behavior(s) leading to this memory fragmentation? I guess it is because non-moveable pages spread across the entire physical memory space. Is there a quick reproducer for that? Thanks. — Best Regards, Yan Zi