All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	mgorman@techsingularity.net, wangkefeng.wang@huawei.com,
	jhubbard@nvidia.com, 21cnbao@gmail.com, ryan.roberts@arm.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm: support multi-size THP numa balancing
Date: Wed, 27 Mar 2024 16:09:23 +0800	[thread overview]
Message-ID: <bc671388-f398-4776-af15-c144f2c39d78@linux.alibaba.com> (raw)
In-Reply-To: <87cyrgo2ez.fsf@yhuang6-desk2.ccr.corp.intel.com>



On 2024/3/27 10:04, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> Now the anonymous page allocation already supports multi-size THP (mTHP),
>> but the numa balancing still prohibits mTHP migration even though it is an
>> exclusive mapping, which is unreasonable.
>>
>> Allow scanning mTHP:
>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
>> pages") skips shared CoW pages' NUMA page migration to avoid shared data
>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
>> NUMA-migrate COW pages that have other uses") change to use page_count()
>> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
>> issue, although there is still a GUP race, the issue seems to have been
>> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared()
>> to skip shared CoW pages though this is not a precise sharers count. To
>> check if the folio is shared, ideally we want to make sure every page is
>> mapped to the same process, but doing that seems expensive and using
>> the estimated mapcount seems can work when running autonuma benchmark.
> 
> Because now we can deal with shared mTHP, it appears even possible to
> remove folio_likely_mapped_shared() check?

IMO, the issue solved by commit 859d4adc3415 is about shared CoW 
mapping, and I prefer to measure it in another patch:)

>> Allow migrating mTHP:
>> As mentioned in the previous thread[1], large folios (including THP) are
>> more susceptible to false sharing issues among threads than 4K base page,
>> leading to pages ping-pong back and forth during numa balancing, which is
>> currently not easy to resolve. Therefore, as a start to support mTHP numa
>> balancing, we can follow the PMD mapped THP's strategy, that means we can
>> reuse the 2-stage filter in should_numa_migrate_memory() to check if the
>> mTHP is being heavily contended among threads (through checking the CPU id
>> and pid of the last access) to avoid false sharing at some degree. Thus,
>> we can restore all PTE maps upon the first hint page fault of a large folio
>> to follow the PMD mapped THP's strategy. In the future, we can continue to
>> optimize the NUMA balancing algorithm to avoid the false sharing issue with
>> large folios as much as possible.
>>
>> Performance data:
>> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
>> Base: 2024-03-25 mm-unstable branch
>> Enable mTHP to run autonuma-benchmark
>>
>> mTHP:16K
>> Base				Patched
>> numa01				numa01
>> 224.70				137.23
>> numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
>> 118.05				50.57
>> numa02				numa02
>> 13.45				9.30
>> numa02_SMT			numa02_SMT
>> 14.80				7.43
>>
>> mTHP:64K
>> Base				Patched
>> numa01				numa01
>> 216.15				135.20
>> numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
>> 115.35				46.93
>> numa02				numa02
>> 13.24				9.24
>> numa02_SMT			numa02_SMT
>> 14.67				7.31
>>
>> mTHP:128K
>> Base				Patched
>> numa01				numa01
>> 205.13				140.41
>> numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
>> 112.93				44.78
>> numa02				numa02
>> 13.16				9.19
>> numa02_SMT			numa02_SMT
>> 14.81				7.39
>>
>> [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/memory.c   | 56 +++++++++++++++++++++++++++++++++++++++++++--------
>>   mm/mprotect.c |  3 ++-
>>   2 files changed, 50 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c30fb4b95e15..36191a9c799c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5068,16 +5068,55 @@ static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_str
>>   	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>   }
>>   
>> +static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma,
>> +				       struct folio *folio, pte_t fault_pte, bool ignore_writable)
>> +{
>> +	int nr = pte_pfn(fault_pte) - folio_pfn(folio);
>> +	unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start);
>> +	unsigned long end = min(start + folio_nr_pages(folio) * PAGE_SIZE, vma->vm_end);
> 
> If start is in the middle of folio, it's possible for end to go beyond
> the end of folio.  So, should be something like below?

Yes, good catch, even though below iteration can skip over the parts 
that exceed the size of that folio.

> 	unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end);

Yes, this looks good to me. Will do in next version. Thanks.

>> +	pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE;
>> +	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
>> +	unsigned long addr;
>> +
>> +	/* Restore all PTEs' mapping of the large folio */
>> +	for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) {
>> +		pte_t pte, old_pte;
>> +		pte_t ptent = ptep_get(start_ptep);
>> +		bool writable = false;
>> +
>> +		if (!pte_present(ptent) || !pte_protnone(ptent))
>> +			continue;
>> +
>> +		if (vm_normal_folio(vma, addr, ptent) != folio)
>> +			continue;
>> +
>> +		if (!ignore_writable) {
>> +			writable = pte_write(pte);
>> +			if (!writable && pte_write_upgrade &&
>> +			    can_change_pte_writable(vma, addr, pte))
>> +				writable = true;
>> +		}
>> +
>> +		old_pte = ptep_modify_prot_start(vma, addr, start_ptep);
>> +		pte = pte_modify(old_pte, vma->vm_page_prot);
>> +		pte = pte_mkyoung(pte);
>> +		if (writable)
>> +			pte = pte_mkwrite(pte, vma);
>> +		ptep_modify_prot_commit(vma, addr, start_ptep, old_pte, pte);
>> +		update_mmu_cache_range(vmf, vma, addr, start_ptep, 1);
> 
> Can this be batched for the whole folio?

I thought about it, but things are a little tricky. The folio may not 
contain continuous protnone PTEs, should skip non-present or 
non-protnone PTEs.

Moreover, it is necessary to define architecture-specified 
ptep_modify_prot_start*_nr and ptep_modify_prot_commit*_nr that can 
handle multiple PTEs, which is in my TODO list including batch numa 
scanning in change_pte_range().

  reply	other threads:[~2024-03-27  8:09 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26 11:51 [PATCH 0/2] support multi-size THP numa balancing Baolin Wang
2024-03-26 11:51 ` [PATCH 1/2] mm: factor out the numa mapping rebuilding into a new helper Baolin Wang
2024-03-26 11:51 ` [PATCH 2/2] mm: support multi-size THP numa balancing Baolin Wang
2024-03-27  2:04   ` Huang, Ying
2024-03-27  8:09     ` Baolin Wang [this message]
2024-03-27  8:21       ` Huang, Ying
2024-03-27  8:47         ` David Hildenbrand
2024-03-28  1:09           ` Huang, Ying
2024-03-28  9:25   ` David Hildenbrand
2024-03-28 11:34     ` Baolin Wang
2024-03-28 12:07       ` David Hildenbrand
2024-03-28 12:25         ` David Hildenbrand
2024-03-28 14:18           ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc671388-f398-4776-af15-c144f2c39d78@linux.alibaba.com \
    --to=baolin.wang@linux.alibaba.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=ryan.roberts@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.