All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Russell King <rmk@arm.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Avi Kivity <avi@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@kernel.dk>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Yanmin Zhang <yanmin_zhang@linux.intel.com>,
	"Luck,Tony" <tony.luck@intel.com>,
	PaulMundt <lethal@linux-sh.org>,
	Chris Metcalf <cmetcalf@tilera.com>
Subject: Re: [PATCH 06/17] arm: mmu_gather rework
Date: Mon, 28 Feb 2011 12:44:47 +0100	[thread overview]
Message-ID: <1298893487.2428.10537.camel@twins> (raw)
In-Reply-To: <20110225215123.GA10026@flint.arm.linux.org.uk>

On Fri, 2011-02-25 at 21:51 +0000, Russell King wrote:
> On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> > I'm not quite sure why you chose to add range tracking on
> > pte_free_tlb(), the only affected code path seems to be unmap_region()
> > where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> > range is much larger than 1 page, and if you do it there you also need
> > it for all the other p??_free_tlb() functions.
> 
> My reasoning is to do with the way the LPAE stuff works.  For the
> explaination below, I'm going to assume a 2 level page table system
> for simplicity.
> 
> The first thing to realise is that if we have L2 entries, then we'll
> have unmapped them first using the usual tlb shootdown interfaces.
> 
> However, when we're freeing the page tables themselves, we should
> already have removed the L2 entries, so all we have are the L1 entries.
> In most 'normal' processors, these aren't cached in any way.
> 
> Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
> address which is covered by the L1 entry will cause that cached entry to
> be invalidated.
> 
> So really this is about getting rid of cached L1 entries, and not the
> usual TLB lookaside entries that you'd come to expect.


Right, so the normal case is:

  unmap_region()
    tlb_gather_mmu()
    unmap_vmas()
      for (; vma; vma = vma->vm_next)
        unmao_page_range()
          tlb_start_vma() -> flush cache range
          zap_*_range()
            ptep_get_and_clear_full() -> batch/track external tlbs
            tlb_remove_tlb_entry() -> batch/track external tlbs
            tlb_remove_page() -> track range/batch page
          tlb_end_vma() -> flush tlb range

 [ for architectures that have hardware page table walkers
   concurrent faults can still load the page tables ]

    free_pgtables()
      while (vma)
        unlink_*_vma()
        free_*_range()
          *_free_tlb()
    tlb_finish_mmu()

  free vmas


Now, if we want to track ranges _and_ have hardware page table walkers
(ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
because the hardware walker could have re-populated the cache after
clearing the PTEs but before freeing the page tables.

What ARM does is it retains the last vma pointer and tracks
pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
hacky.

Mostly because of shift_arg_pages(), where we have:

  shift_arg_pages()
    tlb_gather_mmu()
    free_*_range()
    tlb_finish_mmu()

For which ARM now punts and does a full tlb invalidate (no vma pointer).
But also because it drags along that vma pointer, which might not at all
match the range its actually going to invalidate (and hence its vm_flags
might not accurately reflect things -- at worst more expensive than
needed).

The reason I wanted flush_tlb_range() to take an mm_struct and not the
current vm_area_struct is because we can avoid doing the
flush_tlb_range() from tlb_end_vma() and delay the thing until
tlb_finish_mmu() without having to resort to such games as above. We
could simply track the full range over all VMAs and free'd page-tables
and do one range invalidate.

ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.

For the I-TLBs we could easily use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
cleared pte's had its executable bit set (both ARM and TILE seem to have
such a PTE bit).

I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
needs explicit flushes for huge ptes it might just have to issue
multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().

So my current proposal for generic mmu_gather (not fully coded yet) is
to provide a number of CONFIG_goo switches:

  CONFIG_HAVE_RCU_TABLE_FREE - for architectures that do not walk the
linux page tables in hardware (Sparc64, PowerPC, etc), and others where
TLB flushing isn't delayed by disabling IRQs (Xen, s390).

  CONFIG_HAVE_MMU_GATHER_RANGE - will track start,end ranges from
tlb_remove_tlb_entry() and p*_free_tlb() and issue
flush_tlb_range(mm,start,end) instead of mm-wide invalidates.

  CONFIG_HAVE_MMU_GATHER_ITLB - will use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
issue flush_itlb_range(mm,start,end).

Then there is the optimization s390 wants, which is to do a full mm tlb
flush for fullmm (exit_mmap()) at tlb_gather_mmu() and never again after
that, since there is guaranteed no concurrency to poke at anything.
AFAICT that should work on all architectures so we can do that
unconditionally.

So the biggest problem with implementing the above is TILE, where we
need to figure out wth to do with its hugetlb stuff.

The second biggest problem is with ARM and TILE, where we'd need to
implement flush_itlb_range(). I've already got a patch for all other
architectures to convert flush_tlb_range() to mm_struct.



WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: Russell King <rmk@arm.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Avi Kivity <avi@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@kernel.dk>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Yanmin Zhang <yanmin_zhang@linux.intel.com>,
	"Luck,Tony" <tony.luck@intel.com>,
	PaulMundt <lethal@linux-sh.org>,
	Chris Metcalf <cmetcalf@tilera.com>
Subject: Re: [PATCH 06/17] arm: mmu_gather rework
Date: Mon, 28 Feb 2011 12:44:47 +0100	[thread overview]
Message-ID: <1298893487.2428.10537.camel@twins> (raw)
In-Reply-To: <20110225215123.GA10026@flint.arm.linux.org.uk>

On Fri, 2011-02-25 at 21:51 +0000, Russell King wrote:
> On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> > I'm not quite sure why you chose to add range tracking on
> > pte_free_tlb(), the only affected code path seems to be unmap_region()
> > where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> > range is much larger than 1 page, and if you do it there you also need
> > it for all the other p??_free_tlb() functions.
> 
> My reasoning is to do with the way the LPAE stuff works.  For the
> explaination below, I'm going to assume a 2 level page table system
> for simplicity.
> 
> The first thing to realise is that if we have L2 entries, then we'll
> have unmapped them first using the usual tlb shootdown interfaces.
> 
> However, when we're freeing the page tables themselves, we should
> already have removed the L2 entries, so all we have are the L1 entries.
> In most 'normal' processors, these aren't cached in any way.
> 
> Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
> address which is covered by the L1 entry will cause that cached entry to
> be invalidated.
> 
> So really this is about getting rid of cached L1 entries, and not the
> usual TLB lookaside entries that you'd come to expect.


Right, so the normal case is:

  unmap_region()
    tlb_gather_mmu()
    unmap_vmas()
      for (; vma; vma = vma->vm_next)
        unmao_page_range()
          tlb_start_vma() -> flush cache range
          zap_*_range()
            ptep_get_and_clear_full() -> batch/track external tlbs
            tlb_remove_tlb_entry() -> batch/track external tlbs
            tlb_remove_page() -> track range/batch page
          tlb_end_vma() -> flush tlb range

 [ for architectures that have hardware page table walkers
   concurrent faults can still load the page tables ]

    free_pgtables()
      while (vma)
        unlink_*_vma()
        free_*_range()
          *_free_tlb()
    tlb_finish_mmu()

  free vmas


Now, if we want to track ranges _and_ have hardware page table walkers
(ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
because the hardware walker could have re-populated the cache after
clearing the PTEs but before freeing the page tables.

What ARM does is it retains the last vma pointer and tracks
pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
hacky.

Mostly because of shift_arg_pages(), where we have:

  shift_arg_pages()
    tlb_gather_mmu()
    free_*_range()
    tlb_finish_mmu()

For which ARM now punts and does a full tlb invalidate (no vma pointer).
But also because it drags along that vma pointer, which might not at all
match the range its actually going to invalidate (and hence its vm_flags
might not accurately reflect things -- at worst more expensive than
needed).

The reason I wanted flush_tlb_range() to take an mm_struct and not the
current vm_area_struct is because we can avoid doing the
flush_tlb_range() from tlb_end_vma() and delay the thing until
tlb_finish_mmu() without having to resort to such games as above. We
could simply track the full range over all VMAs and free'd page-tables
and do one range invalidate.

ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.

For the I-TLBs we could easily use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
cleared pte's had its executable bit set (both ARM and TILE seem to have
such a PTE bit).

I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
needs explicit flushes for huge ptes it might just have to issue
multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().

So my current proposal for generic mmu_gather (not fully coded yet) is
to provide a number of CONFIG_goo switches:

  CONFIG_HAVE_RCU_TABLE_FREE - for architectures that do not walk the
linux page tables in hardware (Sparc64, PowerPC, etc), and others where
TLB flushing isn't delayed by disabling IRQs (Xen, s390).

  CONFIG_HAVE_MMU_GATHER_RANGE - will track start,end ranges from
tlb_remove_tlb_entry() and p*_free_tlb() and issue
flush_tlb_range(mm,start,end) instead of mm-wide invalidates.

  CONFIG_HAVE_MMU_GATHER_ITLB - will use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
issue flush_itlb_range(mm,start,end).

Then there is the optimization s390 wants, which is to do a full mm tlb
flush for fullmm (exit_mmap()) at tlb_gather_mmu() and never again after
that, since there is guaranteed no concurrency to poke at anything.
AFAICT that should work on all architectures so we can do that
unconditionally.

So the biggest problem with implementing the above is TILE, where we
need to figure out wth to do with its hugetlb stuff.

The second biggest problem is with ARM and TILE, where we'd need to
implement flush_itlb_range(). I've already got a patch for all other
architectures to convert flush_tlb_range() to mm_struct.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-02-28 11:45 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
2011-02-17 16:23 ` Peter Zijlstra
2011-02-17 16:23 ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 01/17] tile: Fix __pte_free_tlb Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 02/17] mm: mmu_gather rework Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-03-10 15:50   ` Mel Gorman
2011-03-10 15:50     ` Mel Gorman
2011-03-16 18:55     ` Peter Zijlstra
2011-03-16 18:55       ` Peter Zijlstra
2011-03-16 20:15       ` Geert Uytterhoeven
2011-03-16 20:15         ` Geert Uytterhoeven
2011-03-16 21:08         ` Peter Zijlstra
2011-03-16 21:08           ` Peter Zijlstra
2011-03-21  8:47       ` Avi Kivity
2011-03-21  8:47         ` Avi Kivity
2011-04-01 12:07         ` Peter Zijlstra
2011-04-01 12:07           ` Peter Zijlstra
2011-04-01 16:13           ` Linus Torvalds
2011-04-01 16:13             ` Linus Torvalds
2011-04-02  0:07             ` David Miller
2011-04-02  0:07               ` David Miller
2011-02-17 16:23 ` [PATCH 03/17] powerpc: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 04/17] sparc: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 05/17] s390: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 06/17] arm: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-24 16:34   ` Peter Zijlstra
2011-02-24 16:34     ` Peter Zijlstra
2011-02-25 18:04     ` Peter Zijlstra
2011-02-25 18:04       ` Peter Zijlstra
2011-02-25 19:45       ` Peter Zijlstra
2011-02-25 19:45         ` Peter Zijlstra
2011-02-25 19:59         ` Hugh Dickins
2011-02-25 19:59           ` Hugh Dickins
2011-02-25 21:51       ` Russell King
2011-02-25 21:51         ` Russell King
2011-02-28 11:44         ` Peter Zijlstra [this message]
2011-02-28 11:44           ` Peter Zijlstra
2011-02-28 11:59           ` Russell King
2011-02-28 11:59             ` Russell King
2011-02-28 12:06             ` Russell King
2011-02-28 12:06             ` Russell King
2011-02-28 12:06             ` Russell King
2011-02-28 12:06               ` Russell King
2011-02-28 12:25               ` Peter Zijlstra
2011-02-28 12:25                 ` Peter Zijlstra
2011-02-28 12:20             ` Peter Zijlstra
2011-02-28 12:20               ` Peter Zijlstra
2011-02-28 12:28               ` Russell King
2011-02-28 12:28                 ` Russell King
2011-02-28 12:49                 ` Peter Zijlstra
2011-02-28 12:49                   ` Peter Zijlstra
2011-02-28 12:50                   ` Russell King
2011-02-28 12:50                     ` Russell King
2011-02-28 13:03                     ` Peter Zijlstra
2011-02-28 13:03                       ` Peter Zijlstra
2011-02-28 14:18           ` Peter Zijlstra
2011-02-28 14:18             ` Peter Zijlstra
2011-02-28 14:57             ` Russell King
2011-02-28 14:57               ` Russell King
2011-02-28 15:05               ` Peter Zijlstra
2011-02-28 15:05                 ` Peter Zijlstra
2011-02-28 15:15                 ` Russell King
2011-02-28 15:15                   ` Russell King
2011-03-01 22:05           ` Chris Metcalf
2011-03-01 22:05             ` Chris Metcalf
2011-03-01 22:05             ` Chris Metcalf
2011-03-02 10:54             ` Peter Zijlstra
2011-03-02 10:54               ` Peter Zijlstra
2011-03-02 10:54               ` Peter Zijlstra
2011-03-02 10:54               ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 07/17] sh: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 08/17] um: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 09/17] ia64: " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 12/17] s390: use generic RCP page-table freeing Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 13/17] mm: Extended batches for generic mmu_gather Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 14/17] mm: Provide generic range tracking and flushing Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 15/17] arm, mm: Convert arm to generic tlb Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 16/17] ia64, mm: Convert ia64 " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23 ` [PATCH 17/17] sh, mm: Convert sh " Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 16:23   ` Peter Zijlstra
2011-02-17 17:36 ` [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
2011-02-17 17:36   ` Peter Zijlstra
2011-02-17 17:42 ` Peter Zijlstra
2011-02-17 17:42   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1298893487.2428.10537.camel@twins \
    --to=peterz@infradead.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=cmetcalf@tilera.com \
    --cc=davem@davemloft.net \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=lethal@linux-sh.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=npiggin@kernel.dk \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=rmk@arm.linux.org.uk \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=yanmin_zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.