If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time). Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless pagetable walk, to happen concurrently with THP splitting/collapsing. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], after interrupts are re-enabled. Since, interrupts are (usually) disabled during lockless pagetable walk, and serialize_against_pte_lookup will only return after interrupts are enabled, it is protected. So, by what I could understand, if there is no lockless pagetable walk running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). The related functions are: start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. > Leonardo Bras (11): > powerpc/mm: Adds counting method to monitor lockless pgtable walks > asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable > walks > mm/gup: Applies counting method to monitor gup_pgd_range > powerpc/mce_power: Applies counting method to monitor lockless pgtbl > walks > powerpc/perf: Applies counting method to monitor lockless pgtbl walks > powerpc/mm/book3s64/hash: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl > walks > powerpc/kvm/book3s_hv: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/book3s_64: Applies counting method to monitor lockless > pgtbl walks > powerpc/book3s_64: Enables counting method to monitor lockless pgtbl > walk > powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing > > arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ > arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ > arch/powerpc/kernel/mce_power.c | 13 ++++++++++--- > arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ > arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++-- > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++ > arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++ > arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++- > arch/powerpc/kvm/e500_mmu_host.c | 4 ++++ > arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ > arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++ > arch/powerpc/mm/book3s64/mmu_context.c | 1 + > arch/powerpc/mm/book3s64/pgtable.c | 20 +++++++++++++++++++- > arch/powerpc/perf/callchain.c | 5 ++++- > include/asm-generic/pgtable.h | 9 +++++++++ > mm/gup.c | 4 ++++ > 16 files changed, 108 insertions(+), 8 deletions(-) >