Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already

From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: <riel@surriel.com>, <nathan@kernel.org>,
	<akpm@linux-foundation.org>, <shy828301@gmail.com>,
	<willy@infradead.org>, <kirill.shutemov@intel.com>,
	<feng.tang@intel.com>, <zhengjun.xing@linux.intel.com>,
	<linux-mm@kvack.org>
Subject: Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already
Date: Thu, 29 Dec 2022 09:15:54 +0800	[thread overview]
Message-ID: <57261dc9-e498-766f-e244-32b5e48b28b4@intel.com> (raw)
In-Reply-To: <87ilhwxox0.fsf@yhuang6-desk2.ccr.corp.intel.com>


On 12/28/2022 10:06 AM, Huang, Ying wrote:
> Yin Fengwei <fengwei.yin@intel.com> writes:
> 
>> Kernel build regression with LLVM was reported here:
>> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/
>> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP
>> boundaries"). And the commit f35b5d7d676e was reverted.
>>
>> It turned out the regression is related with madvise(MADV_DONTNEED)
>> was used by ld.lld. But with none PMD_SIZE aligned parameter len.
>> trace-bpfcc captured:
>> 531607  531732  ld.lld          do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4
>> 531607  531793  ld.lld          do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4
>>
>> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can
>> trigger split_queue_lock contention raised significantly. perf showed
>> following data:
>>     14.85%     0.00%  ld.lld           [kernel.kallsyms]           [k]
>>        entry_SYSCALL_64_after_hwframe
>>            11.52%
>>                 entry_SYSCALL_64_after_hwframe
>>                 do_syscall_64
>>                 __x64_sys_madvise
>>                 do_madvise.part.0
>>                 zap_page_range
>>                 unmap_single_vma
>>                 unmap_page_range
>>                 page_remove_rmap
>>                 deferred_split_huge_page
>>                 __lock_text_start
>>                 native_queued_spin_lock_slowpath
>>
>> If THP can't be removed from rmap as whole THP, partial THP will be
>> removed from rmap by removing sub-pages from rmap. Even the THP
>> head page is added to deferred queue already, the split_queue_lock
>> will be acquired and check whether the THP head page is in the queue
>> already. Thus, the contention of split_queue_lock is raised.
>>
>> Before acquire split_queue_lock, check and bail out early if the THP
>> head page is in the queue already. The checking without holding
>> split_queue_lock could race with deferred_split_scan, but it doesn't
>> impact the correctness here.
>>
>> Test result of building kernel with ld.lld:
>> commit 7b5a0b664ebe (parent commit of f35b5d7d676e):
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:07.99 real,   26367.77 user,  5063.35 sys
>>
>> commit f35b5d7d676e:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         7:22.15 real,   26235.03 user,  12504.55 sys
>>
>> commit f35b5d7d676e with the fixing patch:
>> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all
>>         6:08.49 real,   26520.15 user,  5047.91 sys
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> 
> Thank you for fixing.
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Thanks for reviewing the patch.

Regards
Yin, Fengwei