Hello Peter, thanks for the feedback! On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote: > On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote: > > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > > chunk of memory (496GB) mapped with THP, it takes an average of 275 > > seconds, which can cause a lot of problems to the load (in qemu case, > > the guest will lock for this time). > > > > Trying to find the source of this bug, I found out most of this time is > > spent on serialize_against_pte_lookup(). This function will take a lot > > of time in smp_call_function_many() if there is more than a couple CPUs > > running the user process. Since it has to happen to all THP mapped, it > > will take a very long time for large amounts of memory. > > > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > > pagetable walk, to happen concurrently with THP splitting/collapsing. > > > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > > after interrupts are re-enabled. > > Since, interrupts are (usually) disabled during lockless pagetable > > walk, and serialize_against_pte_lookup will only return after > > interrupts are enabled, it is protected. > > This is something entirely specific to Power, you shouldn't be touching > generic code at all. Up to v4, I was declaring dummy functions so it would not mess up with other archs: http://patchwork.ozlabs.org/patch/1168779/ But I was recommended to create a generic function that could guide the way to archs: http://patchwork.ozlabs.org/patch/1168775/ The idea was to concentrate all routines of beginning/ending lockless pagetable walks on these functions, and call them instead of irq_disable/irq_enable. Then it was easy to place the refcount-based tracking in these functions. It should only be enabled in case the config chooses to do so. > > Also, I'm not sure I understand things properly. > > So serialize_against_pte_lookup() wants to wait for all currently > out-standing __find_linux_pte() instances (which are very similar to > gup_fast). > > It seems to want to do this before flushing the THP TLB for some reason; > why? Should not THP observe the normal page table freeing rules which > includes a RCU-like grace period like this already. > > Why is THP special here? This doesn't seem adequately explained. "It's necessary to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them." If a there is a THP split/collapse during the lockless pagetable walk, the returned ptep can be a pointing to an invalid pte. To avoid that, the pmd is updated, then serialize_against_pte_lookup is ran. Serialize runs a do_nothing in all cpu in cpu_mask. So, after all cpus finish running do_nothing(), there is a guarantee that if there is any 'lockless pagetable walk' it is running on top of a updated version of this pmd, and so, collapsing/splitting THP is safe. > > Also, specifically to munmap(), this seems entirely superfluous, > munmap() uses the normal page-table freeing code and should be entirely > fine without additional waiting. To be honest, I remember it being needed in munmap case, but I really don't remember the details. I will take a deeper look and come back with this answer. > Furthermore, Power never accurately tracks mm_cpumask(), so using that > makes the whole thing more expensive than it needs to be. Also, I > suppose that is buggered vs file backed THP. That accuracy of mm_cpumask is above my knowledge right now. =) I agree that it's to expensive to do that. That's why I suggested this method, that can check if there is any 'lockless pagetable walk' running before trying to serialize. It reduced the waiting time a lot for large amounts of memory. (more details on cover letter) Best regards, Leonardo BrĂ¡s