linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	James Houghton <jthoughton@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Ray Fucillo <Ray.Fucillo@intersystems.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
Date: Tue, 19 Apr 2022 15:50:00 -0700	[thread overview]
Message-ID: <ec97313f-c7ce-ab23-3934-faddbb782336@oracle.com> (raw)
In-Reply-To: <dcda550d-a92a-c95e-bd08-c578924d7f8d@redhat.com>

On 4/8/22 02:26, David Hildenbrand wrote:
>>>
>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>>>
>>> Sure, with thousands of processes sharing that memory, the size of page
>>> tables required would increase with each and every process. But TBH,
>>> that's in no way different to other file systems where we're even
>>> dealing with PTE tables.
>>
>> The numbers for a real use case I am frequently quoted are something like:
>> 1TB shared mapping, 10,000 processes sharing the mapping
>> 4K PMD Page per 1GB of shared mapping
>> 4M saving for each shared process
>> 9,999 * 4M ~= 39GB savings
> 
> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we
> care about supporting such corner cases that result in a maintenance
> burden? My take is a clear no.
> 
>>
>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
>> it states that performance rather than memory savings was the primary
>> objective.
>>
>> "For hugetlb, the saving on page table memory is not the primary
>>  objective (as hugetlb itself already cuts down page table overhead
>>  significantly), instead, the purpose of using shared page table on hugetlb is
>>  to allow faster TLB refill and smaller cache pollution upon TLB miss.
>>     
>>  With PT sharing, pte entries are shared among hundreds of processes, the
>>  cache consumption used by all the page table is smaller and in return,
>>  application gets much higher cache hit ratio.  One other effect is that
>>  cache hit ratio with hardware page walker hitting on pte in cache will be
>>  higher and this helps to reduce tlb miss latency.  These two effects
>>  contribute to higher application performance."
>>
>> That 'makes sense', but I have never tried to measure any such performance
>> benefit.  It is easier to calculate the memory savings.
> 
> It does makes sense; but then, again, what's specific here about hugetlb?
> 
> Most probably it was just easy to add to hugetlb in contrast to other
> types of shared memory.
> 
>>
>>>
>>> Which results in me wondering if
>>>
>>> a) We should simply use gigantic pages for such extreme use case. Allows
>>>    for freeing up more memory via vmemmap either way.
>>
>> The only problem with this is that many processors in use today have
>> limited TLB entries for gigantic pages.
>>
>>> b) We should instead look into reclaiming reconstruct-able page table.
>>>    It's hard to imagine that each and every process accesses each and
>>>    every part of the gigantic file all of the time.
>>> c) We should instead establish a more generic page table sharing
>>>    mechanism.
>>
>> Yes.  I think that is the direction taken by mshare() proposal.  If we have
>> a more generic approach we can certainly start deprecating hugetlb pmd
>> sharing.
> 
> My strong opinion is to remove it ASAP and get something proper into place.
> 

No arguments about the complexity of this code.  However, there will be some
people who will notice if it is removed.

Whether or not we remove huge pmd sharing support, I would still like to
address the scalability issue.  To do so, taking i_mmap_rwsem in read mode
for fault processing needs to go away.  With this gone, the issue of faults
racing with truncation needs to be addressed as it depended on fault code
taking the mutex.  At a high level, this is fairly simple but hugetlb
reservations add to the complexity.  This was not completely addressed in
this series.

I will be sending out another RFC that more correctly address all the issues
this series attempted to address.  I am not discounting your opinion that we
should get rid of huge pmd sharing.  Rather, I would at least like to get
some eyes on my approach to addressing the issue with reservations during
fault and truncate races.
-- 
Mike Kravetz


  reply	other threads:[~2022-04-19 22:50 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06 20:48 [RFC PATCH 0/5] hugetlb: Change huge pmd sharing Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 1/5] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 2/5] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 3/5] hugetlbfs: move routine remove_huge_page to hugetlb.c Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 4/5] hugetlbfs: catch and handle truncate racing with page faults Mike Kravetz
2022-04-06 20:48 ` [RFC PATCH 5/5] hugetlb: Check for pmd unshare and fault/lookup races Mike Kravetz
2022-04-07 10:08 ` [RFC PATCH 0/5] hugetlb: Change huge pmd sharing David Hildenbrand
2022-04-07 16:17   ` Mike Kravetz
2022-04-08  9:26     ` David Hildenbrand
2022-04-19 22:50       ` Mike Kravetz [this message]
2022-04-20  7:12         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec97313f-c7ce-ab23-3934-faddbb782336@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=peterx@redhat.com \
    --cc=prakash.sangappa@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).