All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nathan Chancellor <nathan@kernel.org>
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: riel@surriel.com, akpm@linux-foundation.org, shy828301@gmail.com,
	willy@infradead.org, kirill.shutemov@intel.com,
	ying.huang@intel.com, feng.tang@intel.com,
	zhengjun.xing@linux.intel.com, linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
Date: Tue, 27 Dec 2022 13:15:14 -0700	[thread overview]
Message-ID: <Y6tSUutBqeXeOrcR@dev-arch.thelio-3990X> (raw)
In-Reply-To: <20221223135207.2275317-1-fengwei.yin@intel.com>

On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
> 
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
> 
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>        entry_SYSCALL_64_after_hwframe
>            11.52%
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 __x64_sys_madvise
>                 do_madvise.part.0
>                 zap_page_range
>                 unmap_single_vma
>                 unmap_page_range
>                 page_remove_rmap
>                 deferred_split_huge_page
>                 __lock_text_start
>                 native_queued_spin_lock_slowpath
> 
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
> 
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
> 
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:07.99 real,   26367.77 user,  5063.35 sys
> 
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         7:22.15 real,   26235.03 user,  12504.55 sys
> 
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>         6:08.49 real,   26520.15 user,  5047.91 sys
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>

I cannot say whether or not this is a good idea or not but it does
resolve the regression I reported:

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
    Time (mean ± σ):     383.003 s ±  0.680 s    [User: 34737.850 s, System: 7287.079 s]
    Range (min … max):   382.218 s … 383.413 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
    Time (mean ± σ):     437.886 s ±  1.030 s    [User: 35888.658 s, System: 14048.871 s]
    Range (min … max):   436.865 s … 438.924 s    3 runs

  Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
    Time (mean ± σ):     384.371 s ±  1.004 s    [User: 35402.880 s, System: 6401.691 s]
    Range (min … max):   383.547 s … 385.489 s    3 runs

Tested-by: Nathan Chancellor <nathan@kernel.org>

> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
> 
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
> 
> 
>  mm/huge_memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
>  	if (PageSwapCache(page))
>  		return;
>  
> +	if (!list_empty(page_deferred_list(page)))
> +		return;
> +
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	if (list_empty(page_deferred_list(page))) {
>  		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
> 
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
> -- 
> 2.34.1
> 
> 
> 


  reply	other threads:[~2022-12-27 20:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor [this message]
2022-12-29  1:14   ` Yin, Fengwei
2022-12-28  1:24 ` David Rientjes
2022-12-29  1:15   ` Yin, Fengwei
2022-12-28  2:06 ` Huang, Ying
2022-12-29  1:15   ` Yin, Fengwei
2022-12-28 23:33 ` Andrew Morton
2022-12-29  1:29   ` Yin, Fengwei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y6tSUutBqeXeOrcR@dev-arch.thelio-3990X \
    --to=nathan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=kirill.shutemov@intel.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.