Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing

From: Mike Kravetz <mike.kravetz@oracle.com>
To: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
Date: Tue, 19 Apr 2022 15:50:00 -0700	[thread overview]
Message-ID: <ec97313f-c7ce-ab23-3934-faddbb782336@oracle.com> (raw)
In-Reply-To: <dcda550d-a92a-c95e-bd08-c578924d7f8d@redhat.com>

On 4/8/22 02:26, David Hildenbrand wrote:
>>>
>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>
>>> Sure, with thousands of processes sharing that memory, the size of page
>>> tables required would increase with each and every process. But TBH,
>>> that's in no way different to other file systems where we're even
>>> dealing with PTE tables.
>>
>> The numbers for a real use case I am frequently quoted are something like:
>> 1TB shared mapping, 10,000 processes sharing the mapping
>> 4K PMD Page per 1GB of shared mapping
>> 4M saving for each shared process
>> 9,999 * 4M ~= 39GB savings
> 
> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
> care about supporting such corner cases that result in a maintenance
> burden? My take is a clear no.
> 
>>
>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>> it states that performance rather than memory savings was the primary
>> objective.
>>
>> "For hugetlb, the saving on page table memory is not the primary
>>  objective (as hugetlb itself already cuts down page table overhead
>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>     
>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>  cache consumption used by all the page table is smaller and in return,
>>  application gets much higher cache hit ratio.  One other effect is that
>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>  higher and this helps to reduce tlb miss latency.  These two effects
>>  contribute to higher application performance."
>>
>> That 'makes sense', but I have never tried to measure any such performance
>> benefit.  It is easier to calculate the memory savings.
> 
> It does makes sense; but then, again, what's specific here about hugetlb?
> 
> Most probably it was just easy to add to hugetlb in contrast to other
> types of shared memory.
> 
>>
>>>
>>> Which results in me wondering if
>>>
>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>    for freeing up more memory via vmemmap either way.
>>
>> The only problem with this is that many processors in use today have
>> limited TLB entries for gigantic pages.
>>
>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>    It's hard to imagine that each and every process accesses each and
>>>    every part of the gigantic file all of the time.
>>> c) We should instead establish a more generic page table sharing
>>>    mechanism.
>>
>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>> a more generic approach we can certainly start deprecating hugetlb pmd
>> sharing.
> 
> My strong opinion is to remove it ASAP and get something proper into place.
> 

No arguments about the complexity of this code.  However, there will be some
people who will notice if it is removed.

Whether or not we remove huge pmd sharing support, I would still like to
address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
for fault processing needs to go away.  With this gone, the issue of faults
racing with truncation needs to be addressed as it depended on fault code
taking the mutex.  At a high level, this is fairly simple but hugetlb
reservations add to the complexity.  This was not completely addressed in
this series.

I will be sending out another RFC that more correctly address all the issues
this series attempted to address.  I am not discounting your opinion that we
should get rid of huge pmd sharing.  Rather, I would at least like to get
some eyes on my approach to addressing the issue with reservations during
fault and truncate races.
-- 
Mike Kravetz