linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
@ 2019-09-27 14:46 Leonardo Bras
  2019-09-27 23:25 ` Leonardo Bras
  0 siblings, 1 reply; 4+ messages in thread
From: Leonardo Bras @ 2019-09-27 14:46 UTC (permalink / raw)
  To: jhubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: benh, paulus, mpe, arnd, aneesh.kumar, christophe.leroy, akpm,
	dan.j.williams, npiggin, mahesh, gregkh, tglx, ganeshgr, allison,
	rppt, yuehaibing, ira.weiny, jgg, keith.busch

[-- Attachment #1: Type: text/plain, Size: 2428 bytes --]

John Hubbard <jhubbard@nvidia.com> writes:

> Hi Leonardo,
>
> Thanks for adding linux-mm to CC for this next round of reviews. For the benefit
> of any new reviewers, I'd like to add that there are some issues that were discovered
> while reviewing the v2 patchset, that are not (yet) addressed in this v3 series.

> Since those issues are not listed in the cover letter above, I'll list them here

Thanks for bringing that.
The cover letter is a great place to put this info, I will keep that in
mind for future patchsets.

>
> 1. The locking model requires a combination of disabling interrupts and
> atomic counting and memory barriers, but
>
> 	a) some memory barriers are missing
> 	(start/end_lockless_pgtbl_walk), and

It seems that it works fine today because of the amount of intructions
executed between the irq_disable / start_lockless_pgtbl_walk and where
the THP collapse/split can happen. (It's very unlikely that it reorders
that much).

But I don't think it would be so bad to put a memory barrier after
irq_disable just in case.

> 	b) some cases (patch #8) fail to disable interrupts

I have done some looking into that, and it seems that some uses of
{start,end}_lockless_pgtbl_walk are unneeded, because they operate in
(nested) guest pgd and I was told it's safe against THP split/collapse.

In other uses, there is no interrupt disable because the function is
called in real mode, with MSR_EE=0, and there we have instructions
disabled, so there is no need to disable them again.

>
> ...so the synchronization appears to be inadequate. (And if it *is* adequate, then
> definitely we need the next item, to explain it.)


>
> 2. Documentation of the synchronization/locking model needs to exist, once we
> figure out the exact details of (1).

I will add the missing doc in the code, so it may be easier to
understand in the future.

>
> 3. Related to (1), I've asked to change things so that interrupt controls and 
> atomic inc/dec are in the same start/end calls--assuming, of course, that the
> caller can tolerate that. 

I am not sure if it would be ok to use irq_{save,restore} in real mode,
I will do some more reading of the docs before addressing this. 
>
> 4. Please see the v2 series for any other details I've missed.
>
> thanks,
> -- 
> John Hubbard
> NVIDIA
>

Thank you for helping, John!

Best regards,
Leonardo Bras

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
  2019-09-27 14:46 [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
@ 2019-09-27 23:25 ` Leonardo Bras
  0 siblings, 0 replies; 4+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:25 UTC (permalink / raw)
  To: jhubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: benh, paulus, mpe, arnd, aneesh.kumar, christophe.leroy, akpm,
	dan.j.williams, npiggin, mahesh, gregkh, tglx, ganeshgr, allison,
	rppt, yuehaibing, ira.weiny, jgg, keith.busch

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

On Fri, 2019-09-27 at 11:46 -0300, Leonardo Bras wrote:
> I am not sure if it would be ok to use irq_{save,restore} in real mode,
> I will do some more reading of the docs before addressing this. 

It looks like it's unsafe to merge irq_{save,restore} in
{start,end}_lockless_pgtbl_walk(), due to a possible access of code
that is not accessible in real mode.

I am sending a v4 for the changes so far.
I will look forward for your feedback.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
  2019-09-24 21:24 Leonardo Bras
@ 2019-09-24 22:58 ` John Hubbard
  0 siblings, 0 replies; 4+ messages in thread
From: John Hubbard @ 2019-09-24 22:58 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar,
	Greg Kroah-Hartman, Thomas Gleixner, Ganesh Goudar,
	Allison Randal, Mike Rapoport, YueHaibing, Ira Weiny,
	Jason Gunthorpe, Keith Busch

On 9/24/19 2:24 PM, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> chunk of memory (496GB) mapped with THP, it takes an average of 275
> seconds, which can cause a lot of problems to the load (in qemu case,
> the guest will lock for this time).
> 
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot
> of time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it
> will take a very long time for large amounts of memory.
> 
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> pagetable walk, to happen concurrently with THP splitting/collapsing.
> 
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> after interrupts are re-enabled.
> Since, interrupts are (usually) disabled during lockless pagetable
> walk, and serialize_against_pte_lookup will only return after
> interrupts are enabled, it is protected.
> 
> So, by what I could understand, if there is no lockless pagetable walk
> running, there is no need to call serialize_against_pte_lookup().
> 
> So, to avoid the cost of running serialize_against_pte_lookup(), I
> propose a counter that keeps track of how many find_current_mm_pte()
> are currently running, and if there is none, just skip
> smp_call_function_many().
> 
> The related functions are:
> start_lockless_pgtbl_walk(mm)
> 	Insert before starting any lockless pgtable walk
> end_lockless_pgtbl_walk(mm)
> 	Insert after the end of any lockless pgtable walk
> 	(Mostly after the ptep is last used)
> running_lockless_pgtbl_walk(mm)
> 	Returns the number of lockless pgtable walks running
> 
> On my workload (qemu), I could see munmap's time reduction from 275
> seconds to 418ms.
> 
> Changes since v2:
>  Rebased to v5.3
>  Adds support on __get_user_pages_fast
>  Adds usage decription to *_lockless_pgtbl_walk()
>  Better style to dummy functions
>  Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Hi Leonardo,

Thanks for adding linux-mm to CC for this next round of reviews. For the benefit
of any new reviewers, I'd like to add that there are some issues that were discovered
while reviewing the v2 patchset, that are not (yet) addressed in this v3 series.

Since those issues are not listed in the cover letter above, I'll list them here:

1. The locking model requires a combination of disabling interrupts and
atomic counting and memory barriers, but

	a) some memory barriers are missing (start/end_lockless_pgtbl_walk), and
	b) some cases (patch #8) fail to disable interrupts

...so the synchronization appears to be inadequate. (And if it *is* adequate, then
definitely we need the next item, to explain it.)

2. Documentation of the synchronization/locking model needs to exist, once we
figure out the exact details of (1).

3. Related to (1), I've asked to change things so that interrupt controls and 
atomic inc/dec are in the same start/end calls--assuming, of course, that the
caller can tolerate that. 

4. Please see the v2 series for any other details I've missed.

thanks,
-- 
John Hubbard
NVIDIA

>  
> Changes since v1:
>  Isolated atomic operations in functions *_lockless_pgtbl_walk()
>  Fixed behavior of decrementing before last ptep was used
>  Link: http://patchwork.ozlabs.org/patch/1163093/
>  
> Leonardo Bras (11):
>   powerpc/mm: Adds counting method to monitor lockless pgtable walks
>   asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
>     walks
>   mm/gup: Applies counting method to monitor gup_pgd_range
>   powerpc/mce_power: Applies counting method to monitor lockless pgtbl
>     walks
>   powerpc/perf: Applies counting method to monitor lockless pgtbl walks
>   powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
>     pgtbl walks
>   powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
>     walks
>   powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
>     pgtbl walks
>   powerpc/kvm/book3s_64: Applies counting method to monitor lockless
>     pgtbl walks
>   powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
>     walk
>   powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
> 
>  arch/powerpc/include/asm/book3s/64/mmu.h     |  3 ++
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +++
>  arch/powerpc/kernel/mce_power.c              | 13 +++++--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c       | 20 +++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c          |  4 ++
>  arch/powerpc/kvm/book3s_hv_nested.c          |  8 ++++
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  9 ++++-
>  arch/powerpc/kvm/e500_mmu_host.c             |  4 ++
>  arch/powerpc/mm/book3s64/hash_tlb.c          |  2 +
>  arch/powerpc/mm/book3s64/hash_utils.c        |  7 ++++
>  arch/powerpc/mm/book3s64/mmu_context.c       |  1 +
>  arch/powerpc/mm/book3s64/pgtable.c           | 40 +++++++++++++++++++-
>  arch/powerpc/perf/callchain.c                |  5 ++-
>  include/asm-generic/pgtable.h                | 15 ++++++++
>  mm/gup.c                                     |  8 ++++
>  16 files changed, 138 insertions(+), 8 deletions(-)
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
@ 2019-09-24 21:24 Leonardo Bras
  2019-09-24 22:58 ` John Hubbard
  0 siblings, 1 reply; 4+ messages in thread
From: Leonardo Bras @ 2019-09-24 21:24 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Greg Kroah-Hartman, Thomas Gleixner,
	Ganesh Goudar, Allison Randal, Mike Rapoport, YueHaibing,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
start_lockless_pgtbl_walk(mm)
	Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
	Insert after the end of any lockless pgtable walk
	(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
	Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839
 
Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/
 
Leonardo Bras (11):
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
    walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
    walk
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/mmu.h     |  3 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +++
 arch/powerpc/kernel/mce_power.c              | 13 +++++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c       | 20 +++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c          |  4 ++
 arch/powerpc/kvm/book3s_hv_nested.c          |  8 ++++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  9 ++++-
 arch/powerpc/kvm/e500_mmu_host.c             |  4 ++
 arch/powerpc/mm/book3s64/hash_tlb.c          |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c        |  7 ++++
 arch/powerpc/mm/book3s64/mmu_context.c       |  1 +
 arch/powerpc/mm/book3s64/pgtable.c           | 40 +++++++++++++++++++-
 arch/powerpc/perf/callchain.c                |  5 ++-
 include/asm-generic/pgtable.h                | 15 ++++++++
 mm/gup.c                                     |  8 ++++
 16 files changed, 138 insertions(+), 8 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-27 23:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-27 14:46 [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
2019-09-27 23:25 ` Leonardo Bras
  -- strict thread matches above, loose matches on Subject: below --
2019-09-24 21:24 Leonardo Bras
2019-09-24 22:58 ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).