* [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
@ 2022-12-23 13:52 Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Yin Fengwei @ 2022-12-23 13:52 UTC (permalink / raw)
To: riel, nathan, akpm, shy828301, willy, kirill.shutemov,
ying.huang, feng.tang, zhengjun.xing, linux-mm
Cc: fengwei.yin
Kernel build regression with LLVM was reported here:
https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
boundaries"). And the commit f35b5d7d676e was reverted.
It turned out the regression is related with madvise(MADV_DONTNEED)
was used by ld.lld. But with none PMD_SIZE aligned parameter len.
trace-bpfcc captured:
531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
trigger split_queue_lock contention raised significantly. perf showed
following data:
14.85% 0.00% ld.lld [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
11.52%
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_madvise
do_madvise.part.0
zap_page_range
unmap_single_vma
unmap_page_range
page_remove_rmap
deferred_split_huge_page
__lock_text_start
native_queued_spin_lock_slowpath
If THP can't be removed from rmap as whole THP, partial THP will be
removed from rmap by removing sub-pages from rmap. Even the THP
head page is added to deferred queue already, the split_queue_lock
will be acquired and check whether the THP head page is in the queue
already. Thus, the contention of split_queue_lock is raised.
Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.
Test result of building kernel with ld.lld:
commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
6:07.99 real, 26367.77 user, 5063.35 sys
commit f35b5d7d676e:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
7:22.15 real, 26235.03 user, 12504.55 sys
commit f35b5d7d676e with the fixing patch:
time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
6:08.49 real, 26520.15 user, 5047.91 sys
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
My first thought was to change the per node deferred queue to per cpu.
It's complicated and may be overkill.
For the race without lock acquired, I didn't see obvious issue here. But I
could miss something here. Let me know if I did. Thanks.
mm/huge_memory.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa..7cde9f702e63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
if (PageSwapCache(page))
return;
+ if (!list_empty(page_deferred_list(page)))
+ return;
+
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (list_empty(page_deferred_list(page))) {
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
@ 2022-12-27 20:15 ` Nathan Chancellor
2022-12-29 1:14 ` Yin, Fengwei
2022-12-28 1:24 ` David Rientjes
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Nathan Chancellor @ 2022-12-27 20:15 UTC (permalink / raw)
To: Yin Fengwei
Cc: riel, akpm, shy828301, willy, kirill.shutemov, ying.huang,
feng.tang, zhengjun.xing, linux-mm
On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
> entry_SYSCALL_64_after_hwframe
> 11.52%
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_madvise
> do_madvise.part.0
> zap_page_range
> unmap_single_vma
> unmap_page_range
> page_remove_rmap
> deferred_split_huge_page
> __lock_text_start
> native_queued_spin_lock_slowpath
>
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
>
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
>
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:07.99 real, 26367.77 user, 5063.35 sys
>
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 7:22.15 real, 26235.03 user, 12504.55 sys
>
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:08.49 real, 26520.15 user, 5047.91 sys
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
I cannot say whether or not this is a good idea or not but it does
resolve the regression I reported:
Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
Time (mean ± σ): 383.003 s ± 0.680 s [User: 34737.850 s, System: 7287.079 s]
Range (min … max): 382.218 s … 383.413 s 3 runs
Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
Time (mean ± σ): 437.886 s ± 1.030 s [User: 35888.658 s, System: 14048.871 s]
Range (min … max): 436.865 s … 438.924 s 3 runs
Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
Time (mean ± σ): 384.371 s ± 1.004 s [User: 35402.880 s, System: 6401.691 s]
Range (min … max): 383.547 s … 385.489 s 3 runs
Tested-by: Nathan Chancellor <nathan@kernel.org>
> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
>
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
>
>
> mm/huge_memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
> if (PageSwapCache(page))
> return;
>
> + if (!list_empty(page_deferred_list(page)))
> + return;
> +
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> if (list_empty(page_deferred_list(page))) {
> count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
> --
> 2.34.1
>
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor
@ 2022-12-28 1:24 ` David Rientjes
2022-12-29 1:15 ` Yin, Fengwei
2022-12-28 2:06 ` Huang, Ying
2022-12-28 23:33 ` Andrew Morton
3 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2022-12-28 1:24 UTC (permalink / raw)
To: Yin Fengwei
Cc: riel, nathan, Andrew Morton, shy828301, Matthew Wilcox,
kirill.shutemov, ying.huang, feng.tang, zhengjun.xing, linux-mm
On Fri, 23 Dec 2022, Yin Fengwei wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
> entry_SYSCALL_64_after_hwframe
> 11.52%
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_madvise
> do_madvise.part.0
> zap_page_range
> unmap_single_vma
> unmap_page_range
> page_remove_rmap
> deferred_split_huge_page
> __lock_text_start
> native_queued_spin_lock_slowpath
>
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
>
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
>
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:07.99 real, 26367.77 user, 5063.35 sys
>
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 7:22.15 real, 26235.03 user, 12504.55 sys
>
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:08.49 real, 26520.15 user, 5047.91 sys
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor
2022-12-28 1:24 ` David Rientjes
@ 2022-12-28 2:06 ` Huang, Ying
2022-12-29 1:15 ` Yin, Fengwei
2022-12-28 23:33 ` Andrew Morton
3 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2022-12-28 2:06 UTC (permalink / raw)
To: Yin Fengwei
Cc: riel, nathan, akpm, shy828301, willy, kirill.shutemov, feng.tang,
zhengjun.xing, linux-mm
Yin Fengwei <fengwei.yin@intel.com> writes:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
> trace-bpfcc captured:
> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>
> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
> trigger split_queue_lock contention raised significantly. perf showed
> following data:
> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
> entry_SYSCALL_64_after_hwframe
> 11.52%
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_madvise
> do_madvise.part.0
> zap_page_range
> unmap_single_vma
> unmap_page_range
> page_remove_rmap
> deferred_split_huge_page
> __lock_text_start
> native_queued_spin_lock_slowpath
>
> If THP can't be removed from rmap as whole THP, partial THP will be
> removed from rmap by removing sub-pages from rmap. Even the THP
> head page is added to deferred queue already, the split_queue_lock
> will be acquired and check whether the THP head page is in the queue
> already. Thus, the contention of split_queue_lock is raised.
>
> Before acquire split_queue_lock, check and bail out early if the THP
> head page is in the queue already. The checking without holding
> split_queue_lock could race with deferred_split_scan, but it doesn't
> impact the correctness here.
>
> Test result of building kernel with ld.lld:
> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:07.99 real, 26367.77 user, 5063.35 sys
>
> commit f35b5d7d676e:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 7:22.15 real, 26235.03 user, 12504.55 sys
>
> commit f35b5d7d676e with the fixing patch:
> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
> 6:08.49 real, 26520.15 user, 5047.91 sys
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Thank you for fixing.
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> ---
> My first thought was to change the per node deferred queue to per cpu.
> It's complicated and may be overkill.
>
> For the race without lock acquired, I didn't see obvious issue here. But I
> could miss something here. Let me know if I did. Thanks.
>
>
> mm/huge_memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa..7cde9f702e63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page)
> if (PageSwapCache(page))
> return;
>
> + if (!list_empty(page_deferred_list(page)))
> + return;
> +
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> if (list_empty(page_deferred_list(page))) {
> count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>
> base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
` (2 preceding siblings ...)
2022-12-28 2:06 ` Huang, Ying
@ 2022-12-28 23:33 ` Andrew Morton
2022-12-29 1:29 ` Yin, Fengwei
3 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2022-12-28 23:33 UTC (permalink / raw)
To: Yin Fengwei
Cc: riel, nathan, shy828301, willy, kirill.shutemov, ying.huang,
feng.tang, zhengjun.xing, linux-mm
On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> Kernel build regression with LLVM was reported here:
> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
> boundaries"). And the commit f35b5d7d676e was reverted.
>
> It turned out the regression is related with madvise(MADV_DONTNEED)
> was used by ld.lld.
Are we able to identify a Fixes: target for this? Thanks.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-27 20:15 ` Nathan Chancellor
@ 2022-12-29 1:14 ` Yin, Fengwei
0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29 1:14 UTC (permalink / raw)
To: Nathan Chancellor
Cc: riel, akpm, shy828301, willy, kirill.shutemov, ying.huang,
feng.tang, zhengjun.xing, linux-mm
On 12/28/2022 4:15 AM, Nathan Chancellor wrote:
> On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote:
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
>> entry_SYSCALL_64_after_hwframe
>> 11.52%
>> entry_SYSCALL_64_after_hwframe
>> do_syscall_64
>> __x64_sys_madvise
>> do_madvise.part.0
>> zap_page_range
>> unmap_single_vma
>> unmap_page_range
>> page_remove_rmap
>> deferred_split_huge_page
>> __lock_text_start
>> native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:07.99 real, 26367.77 user, 5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 7:22.15 real, 26235.03 user, 12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:08.49 real, 26520.15 user, 5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>
> I cannot say whether or not this is a good idea or not but it does
> resolve the regression I reported:
>
> Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe
> Time (mean ± σ): 383.003 s ± 0.680 s [User: 34737.850 s, System: 7287.079 s]
> Range (min … max): 382.218 s … 383.413 s 3 runs
>
> Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e
> Time (mean ± σ): 437.886 s ± 1.030 s [User: 35888.658 s, System: 14048.871 s]
> Range (min … max): 436.865 s … 438.924 s 3 runs
>
> Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty
> Time (mean ± σ): 384.371 s ± 1.004 s [User: 35402.880 s, System: 6401.691 s]
> Range (min … max): 383.547 s … 385.489 s 3 runs
>
> Tested-by: Nathan Chancellor <nathan@kernel.org>
Thanks for testing the patch.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-28 1:24 ` David Rientjes
@ 2022-12-29 1:15 ` Yin, Fengwei
0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29 1:15 UTC (permalink / raw)
To: David Rientjes
Cc: riel, nathan, Andrew Morton, shy828301, Matthew Wilcox,
kirill.shutemov, ying.huang, feng.tang, zhengjun.xing, linux-mm
On 12/28/2022 9:24 AM, David Rientjes wrote:
> On Fri, 23 Dec 2022, Yin Fengwei wrote:
>
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
>> entry_SYSCALL_64_after_hwframe
>> 11.52%
>> entry_SYSCALL_64_after_hwframe
>> do_syscall_64
>> __x64_sys_madvise
>> do_madvise.part.0
>> zap_page_range
>> unmap_single_vma
>> unmap_page_range
>> page_remove_rmap
>> deferred_split_huge_page
>> __lock_text_start
>> native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:07.99 real, 26367.77 user, 5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 7:22.15 real, 26235.03 user, 12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:08.49 real, 26520.15 user, 5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>
> Acked-by: David Rientjes <rientjes@google.com>
Thanks for reviewing the patch.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-28 2:06 ` Huang, Ying
@ 2022-12-29 1:15 ` Yin, Fengwei
0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29 1:15 UTC (permalink / raw)
To: Huang, Ying
Cc: riel, nathan, akpm, shy828301, willy, kirill.shutemov, feng.tang,
zhengjun.xing, linux-mm
On 12/28/2022 10:06 AM, Huang, Ying wrote:
> Yin Fengwei <fengwei.yin@intel.com> writes:
>
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>> 14.85% 0.00% ld.lld [kernel.kallsyms] [k]
>> entry_SYSCALL_64_after_hwframe
>> 11.52%
>> entry_SYSCALL_64_after_hwframe
>> do_syscall_64
>> __x64_sys_madvise
>> do_madvise.part.0
>> zap_page_range
>> unmap_single_vma
>> unmap_page_range
>> page_remove_rmap
>> deferred_split_huge_page
>> __lock_text_start
>> native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:07.99 real, 26367.77 user, 5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 7:22.15 real, 26235.03 user, 12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>> 6:08.49 real, 26520.15 user, 5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>
> Thank you for fixing.
>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Thanks for reviewing the patch.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
2022-12-28 23:33 ` Andrew Morton
@ 2022-12-29 1:29 ` Yin, Fengwei
0 siblings, 0 replies; 9+ messages in thread
From: Yin, Fengwei @ 2022-12-29 1:29 UTC (permalink / raw)
To: Andrew Morton
Cc: riel, nathan, shy828301, willy, kirill.shutemov, ying.huang,
feng.tang, zhengjun.xing, linux-mm
On 12/29/2022 7:33 AM, Andrew Morton wrote:
> On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld.
>
> Are we able to identify a Fixes: target for this? Thanks.
I am not sure. The commit f35b5d7d676e didn't introduce this issue. It
just exposed this issue and it's reverted already. The partial THP
behavior was like this from the first day and not a regression.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-12-29 1:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-23 13:52 [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Yin Fengwei
2022-12-27 20:15 ` Nathan Chancellor
2022-12-29 1:14 ` Yin, Fengwei
2022-12-28 1:24 ` David Rientjes
2022-12-29 1:15 ` Yin, Fengwei
2022-12-28 2:06 ` Huang, Ying
2022-12-29 1:15 ` Yin, Fengwei
2022-12-28 23:33 ` Andrew Morton
2022-12-29 1:29 ` Yin, Fengwei
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.