All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-mm@kvack.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@kernel.dk>,
	Russell King <rmk@arm.linux.org.uk>,
	Chris Metcalf <cmetcalf@tilera.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct
Date: Wed, 02 Mar 2011 22:40:27 +0100	[thread overview]
Message-ID: <1299102027.1310.39.camel@laptop> (raw)
In-Reply-To: <AANLkTimhWKhHojZ-9XZGSh3OzfPhvo__Dib9VfeMWoBQ@mail.gmail.com>

On Wed, 2011-03-02 at 11:19 -0800, Linus Torvalds wrote:
> On Wed, Mar 2, 2011 at 9:59 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > In order to be able to properly support architecture that want/need to
> > support TLB range invalidation, we need to change the
> > flush_tlb_range() argument from a vm_area_struct to an mm_struct
> > because the range might very well extend past one VMA, or not have a
> > VMA at all.
> 
> I really don't think this is right. The whole "drop the icache
> information" thing is a total anti-optimization, since for some
> architectures, the icache flush is the _big_ deal. 

Right, so Tile has the I-cache flush from flush_tlb_range(), I'm not
sure if that's the right thing to do, Documentation/cachetlb.txt seems
to suggest doing it from update_mmu_cache() like things.

However, I really don't know, and would happily be explained how these
things are supposed to work. Also:

> Possibly much
> bigger than the TLB flush itself. Doing an icache flush was much more
> expensive than the TLB flush on alpha, for example (the tlb had ASI's
> etc, the icache did not).

Right, but the problem remains that we do page-table teardown without
having a vma.

Now we can re-introduce I/D variants again by assuming D-only and using
tlb_start_vma() to set a I-too bit on VM_EXEC. (this assumes the vm_args
range is non-executable -- which it had better be).

How about I do something like:

enum {
  TLB_FLUSH_I = 1,
  TLB_FLUSH_D = 2,
  TLB_FLUSH_PAGE = 4,
  TLB_FLUSH_HPAGE = 8,
};

void flush_tlb_range(struct mm_struct *mm, unsigned long start,
		     unsigned long end, unsigned int flags);

And we then do:

tlb_gather_mmu(struct mmu_gather *tlb, ...)
{
  ...
  tlb->flush_type = TLB_FLUSH_D | TLB_FLUSH_PAGE;
}

tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
{
  if (!tlb->fullmm)
    flush_cache_range(vma, vma->vm_start, vma->vm_end);

  if (vma->vm_flags & VM_EXEC)
    tlb->flush_type |= TLB_FLUSH_I;

  if (vma->vm_flags & VM_HUGEPAGE)
    tlb->flush_type |= TLB_FLUSH_HPAGE;
}

tlb_flush_mmu(struct mmu_gather *tlb)
{
  if (!tlb->fullmm && tlb->need_flush) {
    flush_tlb_range(tlb->mm, tlb->start, tlb->end, tlb->flush_type);	
    tlb->start = TASK_SIZE;
    tlb->end = 0;
  }
  ...
}

> > There are various reasons that we need to flush TLBs _after_ freeing
> > the page-tables themselves. For some architectures (x86 among others)
> > this serializes against (both hardware and software) page table
> > walkers like gup_fast().
> 
> This part of the changelog also makes no sense what-so-ever. It's
> actively wrong.
> 
> On x86, we absolutely *must* do the TLB flush _before_ we release the
> page tables. So your commentary is actively wrong and misleading.
> 
> The order has to be:
>  - clear the page table entry, queue the page to be free'd
>  - flush the TLB
>  - free the page (and page tables)
> 
> and nothing else is correct, afaik. So the changelog is pure and utter
> garbage. I didn't look at what the patch actually changed.

OK, so I use the wrong terms, I meant page-table tear-down, where we
remove the pte page pointer from the pmd, remove the pmd page from the
pud etc.

We then flush the TLBs and only then actually free the pages. I think
the confusion stems from the fact that we call tear-down free_pgtables()

The point was that we need to TLB flush _after_ tear-down (before actual
free), not before tear-down. The problem is that currently we either end
up doing too many TLB flushes or one too few.




WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-mm@kvack.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@kernel.dk>,
	Russell King <rmk@arm.linux.org.uk>,
	Chris Metcalf <cmetcalf@tilera.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct
Date: Wed, 02 Mar 2011 22:40:27 +0100	[thread overview]
Message-ID: <1299102027.1310.39.camel@laptop> (raw)
In-Reply-To: <AANLkTimhWKhHojZ-9XZGSh3OzfPhvo__Dib9VfeMWoBQ@mail.gmail.com>

On Wed, 2011-03-02 at 11:19 -0800, Linus Torvalds wrote:
> On Wed, Mar 2, 2011 at 9:59 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > In order to be able to properly support architecture that want/need to
> > support TLB range invalidation, we need to change the
> > flush_tlb_range() argument from a vm_area_struct to an mm_struct
> > because the range might very well extend past one VMA, or not have a
> > VMA at all.
> 
> I really don't think this is right. The whole "drop the icache
> information" thing is a total anti-optimization, since for some
> architectures, the icache flush is the _big_ deal. 

Right, so Tile has the I-cache flush from flush_tlb_range(), I'm not
sure if that's the right thing to do, Documentation/cachetlb.txt seems
to suggest doing it from update_mmu_cache() like things.

However, I really don't know, and would happily be explained how these
things are supposed to work. Also:

> Possibly much
> bigger than the TLB flush itself. Doing an icache flush was much more
> expensive than the TLB flush on alpha, for example (the tlb had ASI's
> etc, the icache did not).

Right, but the problem remains that we do page-table teardown without
having a vma.

Now we can re-introduce I/D variants again by assuming D-only and using
tlb_start_vma() to set a I-too bit on VM_EXEC. (this assumes the vm_args
range is non-executable -- which it had better be).

How about I do something like:

enum {
  TLB_FLUSH_I = 1,
  TLB_FLUSH_D = 2,
  TLB_FLUSH_PAGE = 4,
  TLB_FLUSH_HPAGE = 8,
};

void flush_tlb_range(struct mm_struct *mm, unsigned long start,
		     unsigned long end, unsigned int flags);

And we then do:

tlb_gather_mmu(struct mmu_gather *tlb, ...)
{
  ...
  tlb->flush_type = TLB_FLUSH_D | TLB_FLUSH_PAGE;
}

tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
{
  if (!tlb->fullmm)
    flush_cache_range(vma, vma->vm_start, vma->vm_end);

  if (vma->vm_flags & VM_EXEC)
    tlb->flush_type |= TLB_FLUSH_I;

  if (vma->vm_flags & VM_HUGEPAGE)
    tlb->flush_type |= TLB_FLUSH_HPAGE;
}

tlb_flush_mmu(struct mmu_gather *tlb)
{
  if (!tlb->fullmm && tlb->need_flush) {
    flush_tlb_range(tlb->mm, tlb->start, tlb->end, tlb->flush_type);	
    tlb->start = TASK_SIZE;
    tlb->end = 0;
  }
  ...
}

> > There are various reasons that we need to flush TLBs _after_ freeing
> > the page-tables themselves. For some architectures (x86 among others)
> > this serializes against (both hardware and software) page table
> > walkers like gup_fast().
> 
> This part of the changelog also makes no sense what-so-ever. It's
> actively wrong.
> 
> On x86, we absolutely *must* do the TLB flush _before_ we release the
> page tables. So your commentary is actively wrong and misleading.
> 
> The order has to be:
>  - clear the page table entry, queue the page to be free'd
>  - flush the TLB
>  - free the page (and page tables)
> 
> and nothing else is correct, afaik. So the changelog is pure and utter
> garbage. I didn't look at what the patch actually changed.

OK, so I use the wrong terms, I meant page-table tear-down, where we
remove the pte page pointer from the pmd, remove the pmd page from the
pud etc.

We then flush the TLBs and only then actually free the pages. I think
the confusion stems from the fact that we call tear-down free_pgtables()

The point was that we need to TLB flush _after_ tear-down (before actual
free), not before tear-down. The problem is that currently we either end
up doing too many TLB flushes or one too few.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2011-03-02 21:39 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-02 17:59 [RFC][PATCH 0/6] mm: Unify TLB gather implementations Peter Zijlstra
2011-03-02 17:59 ` Peter Zijlstra
2011-03-02 17:59 ` Peter Zijlstra
2011-03-02 17:59 ` [RFC][PATCH 1/6] mm: Optimize fullmm TLB flushing Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59 ` [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 19:19   ` Linus Torvalds
2011-03-02 19:19     ` Linus Torvalds
2011-03-02 20:58     ` Rik van Riel
2011-03-02 20:58       ` Rik van Riel
2011-03-02 21:40     ` Peter Zijlstra [this message]
2011-03-02 21:40       ` Peter Zijlstra
2011-03-02 21:47       ` David Miller
2011-03-02 21:47         ` David Miller
2011-03-03 17:22         ` Chris Metcalf
2011-03-03 17:22           ` Chris Metcalf
2011-03-03 17:22           ` Chris Metcalf
2011-03-03 18:45           ` David Miller
2011-03-03 18:45             ` David Miller
2011-03-03 18:56             ` Chris Metcalf
2011-03-03 18:56               ` Chris Metcalf
2011-03-03 18:56               ` Chris Metcalf
2011-03-10 18:05           ` [PATCH] arch/tile: optimize icache flush Chris Metcalf
2011-03-10 18:05             ` Chris Metcalf
2011-03-10 18:05             ` Chris Metcalf
2011-03-10 23:19             ` Rik van Riel
2011-03-10 23:19               ` Rik van Riel
2011-03-02 17:59 ` [RFC][PATCH 3/6] mm: Provide generic range tracking and flushing Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59 ` [RFC][PATCH 4/6] arm, mm: Convert arm to generic tlb Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-09 15:16   ` Catalin Marinas
2011-03-09 15:16     ` Catalin Marinas
2011-03-09 15:19     ` Peter Zijlstra
2011-03-09 15:19       ` Peter Zijlstra
2011-03-09 15:36       ` Catalin Marinas
2011-03-09 15:36         ` Catalin Marinas
2011-03-09 15:39         ` Peter Zijlstra
2011-03-09 15:39           ` Peter Zijlstra
2011-03-09 15:48           ` Peter Zijlstra
2011-03-09 15:48             ` Peter Zijlstra
2011-03-09 16:34             ` Catalin Marinas
2011-03-09 16:34               ` Catalin Marinas
2012-05-17  3:05   ` Paul Mundt
2012-05-17  3:05     ` Paul Mundt
2012-05-17  9:30     ` Catalin Marinas
2012-05-17  9:30       ` Catalin Marinas
2012-05-17  9:39       ` Catalin Marinas
2012-05-17  9:39         ` Catalin Marinas
2012-05-17  9:51       ` Russell King
2012-05-17  9:51         ` Russell King
2012-05-17 11:28         ` Peter Zijlstra
2012-05-17 11:28           ` Peter Zijlstra
2012-05-17 12:14           ` Catalin Marinas
2012-05-17 12:14             ` Catalin Marinas
2012-05-17 16:00           ` Catalin Marinas
2012-05-17 16:00             ` Catalin Marinas
2012-05-17 16:24             ` Peter Zijlstra
2012-05-17 16:24               ` Peter Zijlstra
2012-05-17 16:33               ` Peter Zijlstra
2012-05-17 16:33                 ` Peter Zijlstra
2012-05-17 16:44                 ` Peter Zijlstra
2012-05-17 16:44                   ` Peter Zijlstra
2012-05-17 16:59                   ` Peter Zijlstra
2012-05-17 16:59                     ` Peter Zijlstra
2012-05-17 17:01                   ` Catalin Marinas
2012-05-17 17:01                     ` Catalin Marinas
2012-05-17 17:11                     ` Peter Zijlstra
2012-05-17 17:11                       ` Peter Zijlstra
2012-05-21  7:47               ` Martin Schwidefsky
2012-05-21  7:47                 ` Martin Schwidefsky
2012-05-17 17:22             ` Russell King
2012-05-17 17:22               ` Russell King
2012-05-17 18:31               ` Catalin Marinas
2012-05-17 18:31                 ` Catalin Marinas
2011-03-02 17:59 ` [RFC][PATCH 5/6] ia64, mm: Convert ia64 " Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59 ` [RFC][PATCH 6/6] sh, mm: Convert sh " Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra
2011-03-02 17:59   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1299102027.1310.39.camel@laptop \
    --to=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=cmetcalf@tilera.com \
    --cc=davem@davemloft.net \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=npiggin@kernel.dk \
    --cc=riel@redhat.com \
    --cc=rmk@arm.linux.org.uk \
    --cc=schwidefsky@de.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.