Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already

From: Nathan Chancellor <nathan@kernel.org>
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: riel@surriel.com, akpm@linux-foundation.org, shy828301@gmail.com,
	willy@infradead.org, kirill.shutemov@intel.com,
	ying.huang@intel.com, feng.tang@intel.com,
	zhengjun.xing@linux.intel.com, linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
Date: Tue, 27 Dec 2022 13:15:14 -0700	[thread overview]
Message-ID: <Y6tSUutBqeXeOrcR@dev-arch.thelio-3990X> (raw)
In-Reply-To: <20221223135207.2275317-1-fengwei.yin@intel.com>

On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

I cannot say whether or not this is a good idea or not but it does
resolve the regression I reported:

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
    Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
    Range (min … max):   382.218 s … 383.413 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
    Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
    Range (min … max):   436.865 s … 438.924 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
    Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
    Range (min … max):   383.547 s … 385.489 s    3 runs

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
> 
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
> 
> 
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> 
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
> -- 
> 2.34.1
> 
> 
>